Product Catalog + Review Data Extraction and Normalization for Faucet Manufacturer Sites

Budget: $400.0 FIXED / ⭐ 4.99 (30) United States

data-scraping

Title: Web Data Extraction Specialist Needed: Product Catalogs + Reviews from Moen and Competitor Faucet Sites — 3-Day Turnaround Job Type: Fixed-price project Turnaround: 3 calendar days from kickoff Project Summary: We need a data extraction specialist to pull publicly available product catalog, product detail, media, document, review, and Q&A data from several faucet manufacturer websites. The goal is not only to capture the raw source data, but also to normalize it into a consistent structure that can be compared across brands and websites. This project is for manufacturer sites, not retailer marketplaces. The target brands are: * Moen * Delta Faucet * Kohler * American Standard * Pfister * GROHE need the extracted data to be harmonized across all sites, even when each site uses different labels, layouts, review widgets, or product specifications. ⸻ Scope of Work The freelancer will crawl or manually extract publicly available data from each brand’s main product website, focusing on kitchen and bathroom faucet categories. The work should include: 1. Product catalog data * Category pages * Collection pages * Product listing pages * Product detail pages * Product variants and finishes * Model numbers and SKU identifiers * Product availability or discontinued status when visible 2. Product specification data * Product title * Brand * Collection * Model number * SKU or part number * Finish name * Faucet type * Installation type * Number of handles * Hole count * Flow rate * Spout height * Spout reach * Overall height * Hose length * Material * Certifications * Warranty * Included parts * Compatible parts or replacement parts 3. Media and documents * Product image URLs * Image type where inferable: hero, lifestyle, diagram, dimensional, finish swatch, installation, parts * Video URLs if available * Spec sheet URLs * Installation guide URLs * Warranty document URLs * Parts diagram URLs * CAD, BIM, Revit, or technical file URLs where available 4. Reviews and Q&A * Review title * Reviewer display name * Review rating * Review date * Review body/text * Verified purchase flag, if shown * Incentivized or syndicated review flag, if shown * Helpful vote count, if shown * Brand response text, if shown * Brand response date, if shown * Q&A question text * Q&A answer text * Answer source: brand, retailer, customer, unknown * Q&A date, if shown 5. Raw source capture * Raw HTML, JSON, embedded structured data, or page text where available * Source URL for every extracted row * Crawl timestamp * Capture method used: static HTML, browser-rendered, API/JSON endpoint, manual extraction * Notes for blocked, partially loaded, or JavaScript-dependent sections ⸻ Required Deliverables Please deliver both raw data and normalized data. 1. Raw Data Folder Include raw exports by brand and page type. Example folder structure: raw/ moen/ category_pages/ product_pages/ reviews/ documents/ delta/ kohler/ american_standard/ pfister/ grohe/ Raw files may be delivered as: .html .json .csv .txt .xlsx Each raw file should include the original source URL and timestamp. ⸻ 2. Normalized Data Files Please deliver the normalized output as CSV and JSON. Excel is also preferred. Required normalized tables: brands.csv categories.csv products.csv product_variants.csv product_specs.csv product_assets.csv product_documents.csv reviews.csv qa.csv crawl_coverage.csv data_dictionary.csv ⸻ Required Normalized Schema The exact site labels may differ, but the final output should harmonize all fields into the following structure. brands.csv brand_id brand_name brand_site_url country_or_region crawl_status notes categories.csv brand_id category_id category_name parent_category category_url product_count_visible crawl_timestamp products.csv brand_id product_id source_product_id product_url product_title brand_name collection_name category_name subcategory_name model_number sku mpn upc_gtin product_type faucet_type mounting_type installation_type status list_price sale_price currency description_raw description_normalized crawl_timestamp product_variants.csv brand_id product_id variant_id model_number sku finish_name_raw finish_name_normalized finish_code color_family variant_url primary_image_url availability_status product_specs.csv Use one row per product/specification field. brand_id product_id variant_id spec_group_raw spec_name_raw spec_value_raw spec_name_normalized spec_value_normalized unit_raw unit_normalized confidence_score source_url crawl_timestamp Examples of normalized spec names: flow_rate_gpm spout_reach_in spout_height_in overall_height_in hole_count handle_count handle_type spray_functions deckplate_included connection_size connection_type hose_length_in material warranty certifications ada_compliant watersense_certified product_assets.csv brand_id product_id variant_id asset_id asset_type asset_url asset_position asset_title alt_text image_width image_height is_primary asset_category_normalized crawl_timestamp Suggested values for asset_category_normalized: hero lifestyle silo dimension_diagram installation parts_diagram finish_swatch feature_infographic video unknown product_documents.csv brand_id product_id variant_id document_id document_type_raw document_type_normalized document_title document_url file_type language document_date crawl_timestamp Suggested values for document_type_normalized: spec_sheet installation_guide warranty parts_diagram care_guide cad bim revit technical_drawing unknown reviews.csv This is especially important. Please harmonize review data across all sites. brand_id product_id variant_id review_id source_review_id review_url review_title reviewer_display_name reviewer_location rating rating_scale review_date review_body pros cons verified_purchase_flag incentivized_review_flag syndicated_review_flag helpful_vote_count not_helpful_vote_count brand_response_flag brand_response_text brand_response_date source_platform crawl_timestamp Use the exact public reviewer name as displayed on the site. If no name is shown, use: anonymous qa.csv brand_id product_id variant_id qa_id question_id question_text question_author_display_name question_date answer_id answer_text answer_author_display_name answer_source_type answer_date helpful_vote_count source_url crawl_timestamp Suggested values for answer_source_type: brand manufacturer retailer customer staff unknown crawl_coverage.csv brand_id site_section source_url crawl_status fields_expected fields_extracted reviews_visible reviews_extracted qa_visible qa_extracted documents_visible documents_extracted render_required blocked_flag notes crawl_timestamp Suggested crawl_status values: complete partial blocked not_available needs_browser_render manual_review_required ⸻ Timeline This project requires a 3-day turnaround. Day 1: Source Mapping and Sample Pull Deliver: * Confirmed crawl approach for each site * 3–5 sample products per brand * Sample raw data * Sample normalized rows * Notes on review widgets, JavaScript rendering, blocked sections, or missing data Day 2: Full Extraction Deliver: * Full catalog pull for target faucet categories * Product detail data * Specs * Assets * Documents * Review and Q&A extraction where publicly available * Raw files organized by brand Day 3: Normalization, QA, and Final Delivery Deliver: * Final normalized CSV/JSON files * Raw source files * Coverage report * Data dictionary * QA notes * List of fields that could not be captured and why ⸻ Required Skills Ideal freelancer should have experience with: * Web scraping and data extraction * Python, Playwright, Puppeteer, Selenium, BeautifulSoup, or similar tools * JavaScript-rendered websites * Product catalog data * E-commerce or manufacturer PDPs * Review and Q&A extraction * CSV/JSON data normalization * Data QA and schema mapping page. . ⸻ Proposal Instructions In your proposal, please include: 1. Your experience scraping product catalogs or manufacturer sites. 2. Your experience extracting reviews and Q&A. 3. The tools you would use. 4. Whether you can complete this in 3 calendar days. 5. A short description of how you would preserve both raw data and normalized data. 6. One example of a normalized review schema you have used before, or how you would structure one. 7. Any risks you see with JavaScript-rendered review widgets. ⸻ Success Criteria The project will be considered successful if: * Raw data is preserved by brand and source URL. * Normalized data uses one harmonized structure across all sites. * Product, review, Q&A, asset, and document fields are consistently mapped. * Coverage gaps are clearly documented. * Each row includes a source URL and crawl timestamp. * The output can be loaded into a dashboard or database without additional restructuring. * Delivery is completed within 3 calendar days. ⸻ Optional Fixed-Price Milestones Milestone 1 — Day 1 Sample and Crawl Map: Sample extraction from all six sites, schema confirmation, and coverage notes. Milestone 2 — Day 2 Full Raw Pull: Raw product catalog, PDP, review, Q&A, asset, and document data. Milestone 3 — Day 3 Normalized Delivery: Final CSV/JSON/XLSX files, coverage report, and data dictionary. ⸻ Short Screening Questions 1. Can you capture both raw source data and normalized structured data? 2. Can you complete the project in 3 calendar days? 3. Which tools would you use for JavaScript-rendered product and review pages? 4. How would you normalize review title, reviewer name, review body, rating, date, and brand response across different sites? 5. How do you document fields that are visible on a page but fail to extract automatically?

Auf Upwork öffnen