Parse PDFs into Structured JSON

Budżet: $8.0 - $25.0 HOURLY / PART_TIME ⭐ 4.83 (96) Sweden

python, json, javascript, data-extraction

We are looking for an experienced Python developer with strong skills in PDF parsing and data extraction to help us process a large batch of educational PDF files (exam papers) into a structured JSON format. The exams contain a mix of text, multiple-choice questions, math formulas, reading comprehension texts, and graphical elements (diagrams, tables, and images). Your Responsibilities: Data Extraction: Extract text, multiple-choice options, and correct answers from the PDF files. JSON Structuring: Map the extracted data into a predefined, highly structured JSON schema. Image Cropping/Extraction: Programmatically identify, crop, and save relevant images, diagrams, and graphs associated with specific questions. Edge Case Handling: Handle complex layouts, including multi-column text, rotated pages, and questions that span across multiple pages. Required Skills & Experience: Proven experience working with PDF extraction libraries in Python (e.g., PyMuPDF / fitz, pdfplumber, or similar). Experience with OCR tools or Vision-Language Models (e.g., OpenAI GPT-4o, Claude 3.5 Sonnet) for parsing complex graphical layouts is a huge plus. Strong understanding of JSON and data structuring. Attention to detail – the output JSON must be 100% accurate and ready for production use. Project Scope: We will provide a set of test PDFs and the desired JSON schema. You will develop a scalable script/pipeline to process these files. Once the pipeline is validated, it will be run across our entire library of PDFs.

Otwórz na Upwork