PDF Price List Data Extraction Pipeline (AI/Python) — 30 Brands

Rozpočet: $500.0 FIXED / ⭐ 0.00 (0) United Arab Emirates

python, data-extraction, etl-pipelines, microsoft-excel, automation, computer-vision

im a part of a commercial kitchen equipment distribution business and manage price lists from 30+ international brands. Each brand provides an annual PDF price list (200–300 pages each) that contains a mix of product images, multilingual descriptions, technical drawings, and price tables. I need a professional Python developer with AI/LLM experience to build an automated extraction pipeline that pulls structured product and pricing data from these PDFs into a standardized Master Excel database — which will then feed into a custom quotation software I am building. THE PROBLEM These PDFs are NOT simple text tables. They are professionally designed catalogues (Adobe InDesign exported) with: Full-bleed marketing/image pages (must be skipped) Multilingual product description pages (4 languages: IT, EN, FR, DE) Technical drawing pages with dimensions Price table pages (the TARGET) containing: SKU code, model name, dimensions (mm), weight (kg), power specs (W/V/Hz), energy class, refrigerant gas, and list price (€) Layouts vary significantly between brands. A traditional PDF text parser will not work reliably. This requires an AI Vision approach. WHAT I NEED BUILT A repeatable Python pipeline that does the following: Page classification — Convert each PDF page to an image and use AI (GPT-4o Vision or Claude API) to classify each page as: intro, spec, drawing, or price_table. Only price_table pages proceed. Structured data extraction — Send each price_table page image to the AI Vision API with a structured prompt that returns clean JSON: SKU, model name, dimensions, weight, power, energy class, temperature range, list price. Data normalization — Python script cleans the output: standardizes units (mm, kg, W), handles multi-line model names, removes duplicate header rows, validates numeric price fields. Excel output — Exports to a Master Excel file with consistent columns across all brands. Update-ready — Pipeline must be reusable. When a new annual price list arrives, I re-run it on just that PDF and the master database updates. REQUIRED MASTER EXCEL COLUMNS Brand | SKU / Item Code | Model Name | Product Family / Series | Product Category | Width (mm) | Depth (mm) | Height (mm) | Net Weight (kg) | Gross Weight (kg) | Volume (L) | Power Supply (V/Hz) | Power Consumption (W) | Refrigerant Gas | Energy Class | Temperature Range (°C) | List Price (€) | Currency | Price List Version | Price List Date | Source Page | Notes DELIVERABLES Working Python script/pipeline (clean, commented code) Successfully extracted Excel output for 1 full brand PDF (POC first) Documentation on how to run the pipeline for each new brand Handover call to walk me through the process PROJECT PHASES Phase 1 (POC) — Start here: Process 1 sample brand PDF (~280 pages). Deliver clean Excel output. I review accuracy. If 90%+ accurate, we proceed. Phase 2 — Full rollout: Process remaining 29 brand PDFs. Refine extraction prompts per brand layout. Final master database delivered. SKILLS REQUIRED Python (strong) OpenAI GPT-4o Vision API or Anthropic Claude API PDF processing (PyMuPDF, pdfplumber, pdf2image) JSON parsing and data normalization Excel/openpyxl output Experience with document AI / OCR pipelines WHAT I WILL PROVIDE 2–3 sample brand PDFs to start The required Excel schema Clear feedback on extraction accuracy during QA API keys for OpenAI or Anthropic BUDGET Phase 1 (POC): Fixed price —WILL DISCUESS Phase 2 (Full 30 brands): Discuss after POC approval API usage costs will be covered by me separately.

Otevřít na Upwork