Data Scraper & Engineer Needed: Web Scraping, Data Parsing (Nested to Flat), and QA
Rozpočet: -
HOURLY / PART_TIME
⭐ 5.00 (6)
SAU
data-scraping, python, data-extraction, microsoft-excel, selenium-webdriver, selenium, scrapy-framework, etl-pipelines, data-mining
Overview
We need a skilled Data Engineer / Web Scraper to extract, clean, and structure nutritional data from a public directory. Because this data will be used to generate commercial food labels, 100% data accuracy and meticulous attention to detail are absolute requirements.
Scope of Work
Automated Scraping: The target website hosts thousands of food items. However, there is no bulk download feature; each item's data must be exported individually as a CSV file. You will write a robust, respectful automation script (Python/Selenium/Playwright/Scrapy) to download all available item files without overloading the host servers.
Data Parsing & Flattening: The exported CSVs utilize a deeply nested, hierarchical classification system muddled into semi-colon (;) separated formats. You will write a script (e.g., using Python/Pandas) to parse this hierarchical structure and flatten it into a clean, unified relational database format (SQL or structured JSON), capturing the lowest-level nutrient metrics available.
Quality Assurance (QA): Perform rigorous validation and spot-checks against the source website to ensure no numbers, decimal points, or units (grams, milligrams, kcal) were corrupted or shifted during the extraction and flattening process.
Required Skills
High proficiency in Web Scraping tools (Python, Selenium, Playwright, or Scrapy).
Strong background in Data Engineering and Data Cleaning (Pandas, NumPy).
Experience handling nested, hierarchical, or poorly formatted raw text/CSV files.
Extremely detail-oriented with a proven track record in Data QA/Validation.
To Apply, Please Provide:
A brief explanation of the tools/libraries you would choose for this specific workflow (scraping + flattening).
Your estimated timeline & budget to complete the scraping and cleaning phases.
Examples of past projects where you successfully scraped complex structures and delivered highly accurate, flattened datasets.
Note: A sample raw CSV and a screenshot of the target web layout are ready to be shared with shortlisted candidates for a precise technical assessment.
Also there might be a limited access to the portal for location based IPs. the candidate must solve this without any law violation.
Otvoriť na Upwork