Turkish AI-Generated Text Detection

Presupuesto: $300.0 FIXED / ⭐ 0.00 (0) Turkey

machine-learning, artificial-intelligence, tensorflow, deep-learning, data-science, neural-networks

We are seeking a skilled freelancer to enhance our existing (or newly created) machine learning model for detecting AI-generated text in Turkish. The ideal candidate will have experience in natural language processing and machine learning, with a focus on improving model accuracy and efficiency. The project involves working with Python and requires a strong understanding of AI-generated text detection techniques. --- **What we have built:** We have a working alpha system that detects AI-generated text in Turkish academic documents. The current pipeline: - Fine-tuned XLM-RoBERTa (xlm-roberta-base) on a labeled Turkish dataset - Chunk-based inference with sliding window sentence scoring - 94.2% accuracy on definitive decisions in production testing (Unfourtanetly real result is bad) - Trained on 5,000 labeled text across 25 academic disciplines (human vs AI-generated) **What we need:** Improve the model's performance using one or more of the following approaches. You choose the best method based on results: 1. **Temperature scaling / calibration** — Make confidence scores meaningful and well-calibrated 2. **Perplexity-based signal (Binoculars approach)** — Add a training-free generator-agnostic signal using two open Turkish-capable LLMs (e.g. Qwen2.5). Fuse with existing classifier. 3. **Paraphrase augmentation (RADAR approach)** — Augment AI training samples with paraphrased versions to improve robustness against humanization tools 4. **Stronger backbone** — Evaluate mDeBERTa-v3-base or XLM-R-large as drop-in replacement 5. **Active learning** — Identify and prioritize the most informative uncertain examples for labeling You may also build a brand-new model from scratch using the data. We accept that as well. **Deliverables:** - Python scripts only (no UI, no API, no deployment) - Training script with the improved method - Evaluation report: accuracy, FPR per category, comparison with baseline - Saved model weights + inference script **Data:** - 5,000 labeled JSONL examples provided (title-matched human/AI pairs, 53 disciplines, 5 AI models) - Can be increased if needed (by me if needed X10) - Format: `"discipline", "title", "text" (human), "ai_text" (AI), "model"` The most important point is this: we are testing with data that is not in the test set (i.e., not included in training). For example, we are testing with works written before 2020. Even though the scores appear high in Python's own internal test, the scores come out lower in real-world tests with data that was never part of the dataset. For that reason, "being done" does not mean the test performed in Python with the selected/curated data — it means a real-world test that the developer themselves can also conduct.

Abrir en Upwork