Data Scientist
Budget: $45.0 - $70.0
HOURLY / FULL_TIME
⭐ 4.98 (11)
United States
data-science, python, data-analysis
Data Scientist — AI Evaluation Specialist (Contract, Remote)
The work
YellowPad helps businesses turn information buried in documents into structured, auditable data. We're looking for an analytical data scientist to rigorously evaluate the quality of outputs from our AI-powered document data-extraction system. The core question is simple to ask and hard to answer: how good are the outputs, and where should we improve them next?
You will not build or deploy the production system. You will understand the relevant workflow, measure the quality of its outputs, identify where errors come from, and tell the team — with evidence — what's working, what's broken, and where to invest. Your main deliverable is a clear, prioritized written report backed by data.
Engagement
Contract, fully remote. ~20–30 hrs/week to start (flexible), with potential to extend based on fit.
Async-friendly. A few hours of overlap with US Eastern time for check-ins is helpful but the work is largely independent.
Rate: $45–70/hr depending on experience.
What we're looking for
You've measured the quality of an AI, NLP, search, classification, or information-extraction system before, and you know a single accuracy number is rarely the whole truth. You can dig into distributions and underlying errors, reason about sampling and statistical confidence, and run before-and-after comparisons that show whether quality actually improved. You can read documentation, inspect data, ask good questions, and build an accurate mental model of a complex workflow without needing every step explained. And you communicate exceptionally well — turning messy data into a crisp, prioritized recommendation the team can act on.
Core skills (must-have)
Designing evaluation metrics and methods from scratch
Building, sampling, or validating ground-truth datasets
Reasoning about sampling strategy, statistical confidence, and measurement quality
Inspecting errors and explaining what's driving them
Experience evaluating AI, NLP, search, classification, or extraction systems
Strong SQL and Python for analysis (pandas, numpy, visualization, notebooks)
Comfort with semi-structured data (JSON or similar)
Strong analytical writing
Nice to have
Experience with document AI, OCR, or information extraction
Enough AI/NLP literacy to reason about why an extraction or classification system behaves the way it does
To apply
In your proposal, briefly describe one time you measured whether an ML/NLP system's quality actually improved — what metric you used, how you sampled, and what you concluded. Proposals that skip this will not be considered.
Open job