Data Scientist - Improve Cash Advance Repayment Prediction Model Using Amazon SageMaker Autopilot
Budget: $1000.0
FIXED /
⭐ 4.88 (14)
United States
amazon-sagemaker, data-science, python, machine-learning
## About the Role
We are a fintech company providing cash advances (overdrafts) to consumers. Our ML-based decision engine predicts whether a user will repay their advance, and currently our approved book repays at **88%**. We're looking for an experienced data scientist to improve our decisioning using **Amazon SageMaker Autopilot**, while also helping us evaluate whether Autopilot is a fit for our future ML workflow.
**This is a tightly scoped ~2 week project with potential for ongoing work.** Strong candidates who deliver well may be invited to a longer engagement.
---
## Background & Current System
Our decision engine uses a model that produces scores mapped into bins, which determine the maximum overdraft amount a user qualifies for (e.g., $15 to $100 depending on score bin and flow).
**Important context on the baseline:** The 88% figure is the repayment outcome of our *approved* population, not a model discrimination metric.
### Data Sources
- **Plaid API**: Bank transactions (amount, date, name, category hierarchy), account balances
- **Internal databases**: PostgreSQL (ML scores, repayment history) and MongoDB (user transactions)
- **User attributes**: Age, device OS, institution ID, neobank status
- **Payroll/salary detection**: Estimated payroll dates and intervals from transaction patterns
### Target Variable
Binary: whether a user repays their cash advance (repaid = 0, default = 1). Repayment timeliness is also tracked as `repayment_delay_days`, binned into 5 categories (0-7, 8-14, 15-21, 22-28, 29+ days).
### Current Tech Stack
- Python (XGBoost, scikit-learn, Prophet, pandas, numpy)
- Models served as local artifacts (`.json` for XGBoost, `.joblib` for scikit-learn), Docker containerized
- PostgreSQL + MongoDB
- No current SageMaker integration (models trained offline, loaded in-process)
---
## Working Environment
**All work will be performed on our internal AWS workspace.** Credentials and data access will be provisioned at project start. You will not download or move our data off our infrastructure. All notebooks, experiments, and artifacts must live in our workspace.
---
## Scope & Approach
**SageMaker Autopilot is required** — a core goal of this project is for us to evaluate the platform for future use. **That said, if you believe a manual modeling approach (custom feature engineering, tuned XGBoost/LightGBM, calibration) would outperform Autopilot, you are encouraged to run both in parallel and present a structured comparison.** We want to understand the gap between AutoML and a hands-on approach on our specific problem.
We expect a strong candidate to address, at minimum:
- **Cost-sensitive evaluation** — false negatives (approving a future defaulter) carry direct financial cost; aggregate accuracy is not the goal
- **Class imbalance** (~88/12) and probability calibration
- **User segmentation** — new users (Plaid-only) vs. returning users (rich behavioral history) have meaningfully different repayment behavior and data availability
- **Selection bias / reject inference** — our training data only contains outcomes for users the current model approved
- **Temporal integrity** — time-based train/test splits and strict feature cutoffs to prevent leakage
---
## Definition of Done / What Success Means
To avoid any ambiguity at delivery, here is exactly what this engagement does and does not include.
**In scope — what you will deliver:**
- A trained, deployable model artifact (returns predictions from a SageMaker endpoint or a packaged deployment script)
- Evidence that the model outperforms our current model on an **offline, time-based holdout set**, measured on AUC-ROC, PR-AUC, KS, and FNR, broken out by user segment
- An integration and rollout plan (score-to-bin mapping, shadow-scoring and traffic-split recommendation, rollback criteria)
- Full documentation and a reproducible training pipeline
**Out of scope — what this engagement does NOT include:**
- **Validated production lift.** We do not expect you to prove the model improves our real-world 88% repayment rate during this engagement. Because our training data only contains outcomes for users the current model approved (selection bias), true production validation requires a shadow-scoring period and a live traffic split that we will run on our side *after* delivery. Your deliverable is a model that is better *offline* plus a credible plan to validate it in production — not a model proven to lift live repayment.
- Production deployment into our live decisioning path, ongoing monitoring, or model retraining infrastructure (these may be scoped as follow-on work).
In short: **success = a deployable model with demonstrated offline improvement and a sound integration plan**, not a model with proven production results. We call this out so expectations are aligned for both sides before work begins.
---
## Milestones & Payment
The project is structured into three milestones with concrete acceptance criteria. Payment is released per milestone upon meeting the criteria.
### Milestone 1 — Data Exploration & Baseline Validation
- EDA notebook (data quality, class distribution by cohort, segment comparison, drift analysis)
- Reproduce current model baseline within ±1% of stated performance
- Feature engineering plan
- SageMaker-ready CSV with documented time-based train/test split
- *Acceptance: notebook runs end-to-end on our workspace; baseline reproduced; split is time-based; CSV validated as Autopilot-compatible*
### Milestone 2 — Model Experimentation & Results
- 2+ Autopilot runs with full metrics; manual model results if proposed
- Comparison report: AUC-ROC, PR-AUC, KS, F1, FNR for baseline vs. candidates
- Metrics broken out by new vs. returning user segments
- Threshold analysis with estimated financial impact at 3+ operating points
- Feature importance (SHAP or equivalent); selection bias assessment
- *Acceptance: all models on identical time-based holdout; measurable improvement over baseline; segment breakout present; financial impact quantified*
### Milestone 3 — Deployment Artifacts & Documentation
- Selected model deployed to a SageMaker endpoint (or packaged with deployment script)
- Integration document (score-to-bin mapping, rollout strategy, latency benchmarks)
- Full reproducibility package (notebooks, scripts, configs)
- Model card (target definition, data range, limitations including selection bias, monitoring recommendations)
- *Acceptance: endpoint returns predictions on sample payload; integration doc addresses bin/limit logic; all code runs without modification*
---
## Required Skills & Experience
- 3+ years in data science / ML
- Proven SageMaker experience, specifically Autopilot
- Strong Python (pandas, numpy, scikit-learn, XGBoost)
- Binary classification, ideally in fintech, credit risk, or lending
- Imbalanced classification techniques and probability calibration
- Feature engineering from transactional/financial data
- Classification evaluation metrics (AUC-ROC, PR-AUC, KS, F1)
- AWS (S3, SageMaker, IAM)
## Nice-to-Have
- Plaid API or banking transaction data experience
- Credit scoring / underwriting background
- Reject inference / selection bias handling
- Model explainability (SHAP, LIME)
- SageMaker real-time endpoint deployment
---
## Timeline
~2 weeks of focused work (3 weeks upper bound for data quality issues or extra experimentation). Data access provided upfront.
Open job