Python Web Scraper – Extract ~8000 products & ~1000 categories from login-protected B2B site
Budget: $160.0
FIXED /
⭐ 0.00 (0)
Hungary
selenium, python, microsoft-excel, data-extraction, data-scraping
Hi,
I need an experienced Python web scraper developer to extract data from a Hungarian industrial B2B e-commerce website. I will share the exact target URL and credentials in private chat with the selected freelancer.
The target website has approximately 8,000 products and a very deep category tree with approximately 1,000 categories/subcategories. Quality and perfection are my top priorities for this data extraction.
IMPORTANT NOTE: The website requires a username and password login to see the product prices. Your script must be able to handle session login/cookies to extract the data. I will provide temporary login credentials in private chat.
CRITICAL REQUIREMENT: I am only interested in hiring a freelancer who can deliver a complete and flawless data extraction. Every single product, full description, and every single image must be scraped without any missing data.
You will need to deliver TWO separate Excel files based on my specific template layouts, and multiple ZIP files containing the product and category images.
---
### SYSTEM RULES & LOGIC FOR COMPATIBILITY:
The script must follow these exact logic rules to match my e-commerce system layout:
#### 1. Product Sheet & Variant Grouping Logic:
- Parent-Child Mapping: Many products have multiple variants listed in a table. Each variant must become a separate row in the Excel. You must define the main product as the "Parent" and map the variants as "Children" by creating a Parent SKU column. The variants must point to their parent's SKU.
- SKU Generation & Cleaning: Take the original product code and strip out all spaces, commas, and slashes to make a clean, alphanumeric SKU. For example, a code like ABC-123/XYZ must become ABC-123XYZ. This clean SKU must be used in the SKU column and for image renaming.
- Manufacturer: Create a Manufacturer column and hardcode the site's brand name (I will provide the brand name in private chat) for all rows.
- Canonical URLs: Create a Canonical URL column. For all child/variant rows, this column must map back to the Parent product's SEO URL to avoid duplicate content issues.
- Category Page Visibility: Create a column for category page visibility. Hardcode the Parent row as 1 (visible on listing pages) and all Child/Variant rows as 0 (hidden from listing pages).
- Mandatory Price Columns: The import requires both Net Price and Gross Price columns. Prices must contain raw numbers only without currency symbols or spaces. Calculate the Gross price using 27% Hungarian VAT (Net x 1.27).
- Discount Prices: If you detect promotional prices, map the original price to the standard column, and the promotional price into the discount price column. Otherwise, leave blank.
- Descriptions & Link Cleaning: Extract descriptions keeping their original HTML source formatting (tags like p, ul, li must remain intact). Crucially, your script must clean or convert any relative internal links into absolute URLs pointing back to the source domain.
- Image Links & Versioning: Map the product image URLs into the image link column (separate multiple images with a vertical bar). Create an image version column and hardcode its value as 1 for all rows.
- Image Renaming: Main images must be renamed as the clean SKU (e.g., ABC-123XYZ.jpg). Additional images must use: [Clean_SKU]_altpic_[num].jpg (e.g., ABC-123XYZ_altpic_1.jpg).
- Status & Stock Columns: Hardcode Status as 1 (active), Stock Management as 1, Quantity as 0, and Purchasable if out of stock as 1.
#### 2. Product Filter Parameters Sheet / Logic:
- Extract technical parameters from the product specification tables as clean Key-Value pairs mapped to the parameters column, so I can use them as filters.
#### 3. Category Sheet Logic:
- Hierarchical Order: In the Excel rows, Parent Categories MUST be listed above/before their Child/Subcategories. This is critical because there are around 1,000 categories to map.
- ID Generation: Dynamically generate unique, incremental numeric IDs for each category and subcategory (e.g., 1001, 1002) and include Category ID and Parent ID columns to map the tree correctly.
- Category Image Renaming: Category images must be downloaded and renamed strictly using the newly generated Category ID (e.g., 1001.jpg) and bundled into a separate ZIP (Max 50MB).
- SEO Friendly URL: Generate an SEO URL for each category without any accents or special characters.
- SEO Tags: Populate TITLE tag and DESCRIPTION META tag columns using the category texts.
- Visibility Switches: Hardcode the values as 1 for visibility columns.
- Category Formatting: For multi-level categories, separate the levels using the vertical bar character.
#### 4. Product Image Renaming & ZIP Splitting Rules:
- Max 10 Images Limit: A maximum of 10 images per product is allowed (1 main image + 9 additional images). Do not extract more than 10 images per variant.
- ZIP File Size Limit: There is a strict 100MB limit per uploaded ZIP file for images. Your script must automatically split the downloaded images into multiple separate ZIP files, ensuring each ZIP file stays under 100MB.
#### 5. General System Rules:
- Language & Encoding: Hungarian language text. UTF-8 encoding is strictly mandatory. No broken characters allowed.
- File Size Limit: There is a 20MB upload limit for the Excel file. If the final Excel exceeds 20MB, deliver it as a compressed ZIP file.
---
**How to apply:**
- Please provide your fixed-price quote for the entire project.
- Confirm experience scraping login-protected websites.
- Confirm your script can handle the custom SKU stripping, Parent-Child mapping, HTML link cleaning, sequential Category ID mapping, Key-Value parameter extraction, the Max 10 images limit, and the automatic 100MB ZIP file splitting logic dynamically.
- Propose any ideas or improvements on how to make the data extraction cleaner or more robust.
- Tell me which libraries you would use (Playwright/Selenium preferred).
Auf Upwork öffnen