PDF Price List Data Extraction Pipeline (AI/Python) — 30 Brands
Rozpočet: $500.0
FIXED /
⭐ 0.00 (0)
United Arab Emirates
python, data-extraction, etl-pipelines, microsoft-excel, automation, computer-vision
im a part of a commercial kitchen equipment distribution business and manage price lists from 30+ international brands. Each brand provides an annual PDF price list (200–300 pages each) that contains a mix of product images, multilingual descriptions, technical drawings, and price tables.
I need a professional Python developer with AI/LLM experience to build an automated extraction pipeline that pulls structured product and pricing data from these PDFs into a standardized Master Excel database — which will then feed into a custom quotation software I am building.
THE PROBLEM
These PDFs are NOT simple text tables. They are professionally designed catalogues (Adobe InDesign exported) with:
Full-bleed marketing/image pages (must be skipped)
Multilingual product description pages (4 languages: IT, EN, FR, DE)
Technical drawing pages with dimensions
Price table pages (the TARGET) containing: SKU code, model name, dimensions (mm), weight (kg), power specs (W/V/Hz), energy class, refrigerant gas, and list price (€)
Layouts vary significantly between brands. A traditional PDF text parser will not work reliably. This requires an AI Vision approach.
WHAT I NEED BUILT
A repeatable Python pipeline that does the following:
Page classification — Convert each PDF page to an image and use AI (GPT-4o Vision or Claude API) to classify each page as: intro, spec, drawing, or price_table. Only price_table pages proceed.
Structured data extraction — Send each price_table page image to the AI Vision API with a structured prompt that returns clean JSON: SKU, model name, dimensions, weight, power, energy class, temperature range, list price.
Data normalization — Python script cleans the output: standardizes units (mm, kg, W), handles multi-line model names, removes duplicate header rows, validates numeric price fields.
Excel output — Exports to a Master Excel file with consistent columns across all brands.
Update-ready — Pipeline must be reusable. When a new annual price list arrives, I re-run it on just that PDF and the master database updates.
REQUIRED MASTER EXCEL COLUMNS
Brand | SKU / Item Code | Model Name | Product Family / Series | Product Category | Width (mm) | Depth (mm) | Height (mm) | Net Weight (kg) | Gross Weight (kg) | Volume (L) | Power Supply (V/Hz) | Power Consumption (W) | Refrigerant Gas | Energy Class | Temperature Range (°C) | List Price (€) | Currency | Price List Version | Price List Date | Source Page | Notes
DELIVERABLES
Working Python script/pipeline (clean, commented code)
Successfully extracted Excel output for 1 full brand PDF (POC first)
Documentation on how to run the pipeline for each new brand
Handover call to walk me through the process
PROJECT PHASES
Phase 1 (POC) — Start here:
Process 1 sample brand PDF (~280 pages). Deliver clean Excel output. I review accuracy. If 90%+ accurate, we proceed.
Phase 2 — Full rollout:
Process remaining 29 brand PDFs. Refine extraction prompts per brand layout. Final master database delivered.
SKILLS REQUIRED
Python (strong)
OpenAI GPT-4o Vision API or Anthropic Claude API
PDF processing (PyMuPDF, pdfplumber, pdf2image)
JSON parsing and data normalization
Excel/openpyxl output
Experience with document AI / OCR pipelines
WHAT I WILL PROVIDE
2–3 sample brand PDFs to start
The required Excel schema
Clear feedback on extraction accuracy during QA
API keys for OpenAI or Anthropic
BUDGET
Phase 1 (POC): Fixed price —WILL DISCUESS
Phase 2 (Full 30 brands): Discuss after POC approval
API usage costs will be covered by me separately.
Otevřít na Upwork