Parse PDFs into Structured JSON
Budżet: $8.0 - $25.0
HOURLY / PART_TIME
⭐ 4.83 (96)
Sweden
python, json, javascript, data-extraction
We are looking for an experienced Python developer with strong skills in PDF parsing and data extraction to help us process a large batch of educational PDF files (exam papers) into a structured JSON format.
The exams contain a mix of text, multiple-choice questions, math formulas, reading comprehension texts, and graphical elements (diagrams, tables, and images).
Your Responsibilities:
Data Extraction: Extract text, multiple-choice options, and correct answers from the PDF files.
JSON Structuring: Map the extracted data into a predefined, highly structured JSON schema.
Image Cropping/Extraction: Programmatically identify, crop, and save relevant images, diagrams, and graphs associated with specific questions.
Edge Case Handling: Handle complex layouts, including multi-column text, rotated pages, and questions that span across multiple pages.
Required Skills & Experience:
Proven experience working with PDF extraction libraries in Python (e.g., PyMuPDF / fitz, pdfplumber, or similar).
Experience with OCR tools or Vision-Language Models (e.g., OpenAI GPT-4o, Claude 3.5 Sonnet) for parsing complex graphical layouts is a huge plus.
Strong understanding of JSON and data structuring.
Attention to detail – the output JSON must be 100% accurate and ready for production use.
Project Scope:
We will provide a set of test PDFs and the desired JSON schema.
You will develop a scalable script/pipeline to process these files.
Once the pipeline is validated, it will be run across our entire library of PDFs.
Otwórz na Upwork