Python Developer + Data Specialist — Multilingual OCR Test Dataset (30+ Languages)

Presupuesto: - HOURLY / PART_TIME ⭐ 5.00 (1) USA

natural-language-processing, data-annotation, python, ocr-algorithms, data-collection

Project Overview We are building an AI-powered document intelligence product and need a high-quality test dataset to benchmark accuracy across multiple languages and scripts. This is a data specialist role — you will not be writing software for our product. Your job is to collect, generate, organise, and label documents. The dataset has two components: • Handwriting samples — downloading and organising existing public datasets, plus collecting fresh samples from native-language writers for scripts not covered by public data • Synthetic printed documents — a Python script that generates realistic fake documents (invoices, forms, contracts, etc.) in multiple languages, with matching ground-truth JSON files for accuracy measurement Languages & Scripts in Scope Coverage is required across five regions: Region Languages / Scripts Europe English, French, German, Italian, Spanish, Portuguese, Dutch, Polish, Romanian, Czech, Swedish, Norwegian, Russian, Ukrainian Latin America Spanish (Mexico, Argentina, Colombia), Portuguese (Brazil) India Hindi, Bengali, Marathi, Tamil, Telugu, Kannada, Malayalam, Punjabi, Gujarati, Odia, Urdu — 9 distinct scripts China & Japan Simplified Chinese, Traditional Chinese, Japanese (hiragana + katakana + kanji mixed) Supplementary Arabic, Hebrew, Thai (handwriting samples only) Deliverables You will deliver the following over a 6-week engagement (milestones below): Milestone 1 — Public Dataset Download & Organisation Download and organise ~13 freely available academic handwriting datasets into a defined folder structure. Verify all files are readable and uncorrupted. Milestone 2 — Document Generation Script Write a Python script (faker + reportlab or weasyprint) that generates realistic fake documents in any locale. Script must handle right-to-left text (Arabic, Hebrew, Urdu) and all non-Latin scripts. Every generated PDF must have a matching .json file with the exact field values used. Milestone 3 — Full Template Set Run the script to generate ~1,610 documents across 8 document types (invoice, purchase order, contract, shipping, form, medical record, bank statement, immigration form) and all locales. Apply a scan-degradation pipeline to produce 3 image variants per document (~6,440 total images). Produce all ground-truth JSON files. Milestone 4 — Fresh Handwriting Collection Collect fresh handwriting from native writers for all languages not covered by public datasets — approximately 15 languages. You are responsible for sourcing and paying writers (include all costs in your M4 quote). Deliver 300 pages labelled as printed/handwritten/mixed for detection testing. Required Skills Please only apply if you can honestly confirm all of the following: • Python 3.10+, experience with reportlab or weasyprint for PDF generation • Experience with the faker library including non-English locales • Image processing in Python: Pillow, pdf2image • Ability to register for and download academic datasets • Ability to source native-language writers on Fiverr or equivalent — needed for ~15 languages • Understanding of right-to-left text rendering (Arabic, Hebrew, Urdu) • Good written English for documentation and README Strong plus: • Prior work on ML datasets, OCR, or document AI • Multilingual font handling in PDF generation • Based in or with strong connections to India, Europe, or East Asia — helps with writer sourcing Engagement Type Fixed-price milestone project. Please quote your price per milestone (M1 through M4) in your proposal. Your M4 quote must include all writer and scanning costs — we will not make separate payments to third-party writers. How to Apply In your proposal please include: • Relevant experience — any prior work with multilingual datasets, synthetic document generation, or OCR data • Confirmation that you have used faker and either reportlab or weasyprint (or a GitHub link showing this) • Your plan for sourcing writers for Indic scripts — Tamil, Telugu, Kannada, Malayalam, Gujarati, Punjabi, and Odia are the hardest to find • Your quoted price per milestone (M1–M4), including all writer and scanning costs in M4 • Your estimated timeline, and whether you can start writer outreach in parallel with M1 We respond to all proposals within 48 hours. We are a small, fast-moving team — we want someone who communicates clearly, delivers clean organised work, and proactively flags problems. Terms • All work product, scripts, and datasets become our exclusive property upon milestone payment • All generated documents must carry a 'SAMPLE — TEST DATA' watermark — you are responsible for this • Do not use any real personal, financial, or medical data — all content must be synthetic • Do not generate anything that could constitute forgery of government documents • You are responsible for complying with the terms of use of every public dataset you download

Abrir en Upwork