Data Scientist

Budget: $45.0 - $70.0 HOURLY / FULL_TIME ⭐ 4.98 (11) United States

data-science, python, data-analysis

Data Scientist — AI Evaluation Specialist (Contract, Remote) The work YellowPad helps businesses turn information buried in documents into structured, auditable data. We're looking for an analytical data scientist to rigorously evaluate the quality of outputs from our AI-powered document data-extraction system. The core question is simple to ask and hard to answer: how good are the outputs, and where should we improve them next? You will not build or deploy the production system. You will understand the relevant workflow, measure the quality of its outputs, identify where errors come from, and tell the team — with evidence — what's working, what's broken, and where to invest. Your main deliverable is a clear, prioritized written report backed by data. Engagement Contract, fully remote. ~20–30 hrs/week to start (flexible), with potential to extend based on fit. Async-friendly. A few hours of overlap with US Eastern time for check-ins is helpful but the work is largely independent. Rate: $45–70/hr depending on experience. What we're looking for You've measured the quality of an AI, NLP, search, classification, or information-extraction system before, and you know a single accuracy number is rarely the whole truth. You can dig into distributions and underlying errors, reason about sampling and statistical confidence, and run before-and-after comparisons that show whether quality actually improved. You can read documentation, inspect data, ask good questions, and build an accurate mental model of a complex workflow without needing every step explained. And you communicate exceptionally well — turning messy data into a crisp, prioritized recommendation the team can act on. Core skills (must-have) Designing evaluation metrics and methods from scratch Building, sampling, or validating ground-truth datasets Reasoning about sampling strategy, statistical confidence, and measurement quality Inspecting errors and explaining what's driving them Experience evaluating AI, NLP, search, classification, or extraction systems Strong SQL and Python for analysis (pandas, numpy, visualization, notebooks) Comfort with semi-structured data (JSON or similar) Strong analytical writing Nice to have Experience with document AI, OCR, or information extraction Enough AI/NLP literacy to reason about why an extraction or classification system behaves the way it does To apply In your proposal, briefly describe one time you measured whether an ML/NLP system's quality actually improved — what metric you used, how you sampled, and what you concluded. Proposals that skip this will not be considered.

Open job