Product Catalog + Review Data Extraction and Normalization for Faucet Manufacturer Sites
Buget: $400.0
FIXED /
⭐ 4.99 (30)
United States
data-scraping
Title:
Web Data Extraction Specialist Needed: Product Catalogs + Reviews from Moen and Competitor Faucet Sites — 3-Day Turnaround
Job Type:
Fixed-price project
Turnaround:
3 calendar days from kickoff
Project Summary:
We need a data extraction specialist to pull publicly available product catalog, product detail, media, document, review, and Q&A data from several faucet manufacturer websites. The goal is not only to capture the raw source data, but also to normalize it into a consistent structure that can be compared across brands and websites.
This project is for manufacturer sites, not retailer marketplaces. The target brands are:
* Moen
* Delta Faucet
* Kohler
* American Standard
* Pfister
* GROHE
need the extracted data to be harmonized across all sites, even when each site uses different labels, layouts, review widgets, or product specifications.
⸻
Scope of Work
The freelancer will crawl or manually extract publicly available data from each brand’s main product website, focusing on kitchen and bathroom faucet categories.
The work should include:
1. Product catalog data
* Category pages
* Collection pages
* Product listing pages
* Product detail pages
* Product variants and finishes
* Model numbers and SKU identifiers
* Product availability or discontinued status when visible
2. Product specification data
* Product title
* Brand
* Collection
* Model number
* SKU or part number
* Finish name
* Faucet type
* Installation type
* Number of handles
* Hole count
* Flow rate
* Spout height
* Spout reach
* Overall height
* Hose length
* Material
* Certifications
* Warranty
* Included parts
* Compatible parts or replacement parts
3. Media and documents
* Product image URLs
* Image type where inferable: hero, lifestyle, diagram, dimensional, finish swatch, installation, parts
* Video URLs if available
* Spec sheet URLs
* Installation guide URLs
* Warranty document URLs
* Parts diagram URLs
* CAD, BIM, Revit, or technical file URLs where available
4. Reviews and Q&A
* Review title
* Reviewer display name
* Review rating
* Review date
* Review body/text
* Verified purchase flag, if shown
* Incentivized or syndicated review flag, if shown
* Helpful vote count, if shown
* Brand response text, if shown
* Brand response date, if shown
* Q&A question text
* Q&A answer text
* Answer source: brand, retailer, customer, unknown
* Q&A date, if shown
5. Raw source capture
* Raw HTML, JSON, embedded structured data, or page text where available
* Source URL for every extracted row
* Crawl timestamp
* Capture method used: static HTML, browser-rendered, API/JSON endpoint, manual extraction
* Notes for blocked, partially loaded, or JavaScript-dependent sections
⸻
Required Deliverables
Please deliver both raw data and normalized data.
1. Raw Data Folder
Include raw exports by brand and page type.
Example folder structure:
raw/
moen/
category_pages/
product_pages/
reviews/
documents/
delta/
kohler/
american_standard/
pfister/
grohe/
Raw files may be delivered as:
.html
.json
.csv
.txt
.xlsx
Each raw file should include the original source URL and timestamp.
⸻
2. Normalized Data Files
Please deliver the normalized output as CSV and JSON. Excel is also preferred.
Required normalized tables:
brands.csv
categories.csv
products.csv
product_variants.csv
product_specs.csv
product_assets.csv
product_documents.csv
reviews.csv
qa.csv
crawl_coverage.csv
data_dictionary.csv
⸻
Required Normalized Schema
The exact site labels may differ, but the final output should harmonize all fields into the following structure.
brands.csv
brand_id
brand_name
brand_site_url
country_or_region
crawl_status
notes
categories.csv
brand_id
category_id
category_name
parent_category
category_url
product_count_visible
crawl_timestamp
products.csv
brand_id
product_id
source_product_id
product_url
product_title
brand_name
collection_name
category_name
subcategory_name
model_number
sku
mpn
upc_gtin
product_type
faucet_type
mounting_type
installation_type
status
list_price
sale_price
currency
description_raw
description_normalized
crawl_timestamp
product_variants.csv
brand_id
product_id
variant_id
model_number
sku
finish_name_raw
finish_name_normalized
finish_code
color_family
variant_url
primary_image_url
availability_status
product_specs.csv
Use one row per product/specification field.
brand_id
product_id
variant_id
spec_group_raw
spec_name_raw
spec_value_raw
spec_name_normalized
spec_value_normalized
unit_raw
unit_normalized
confidence_score
source_url
crawl_timestamp
Examples of normalized spec names:
flow_rate_gpm
spout_reach_in
spout_height_in
overall_height_in
hole_count
handle_count
handle_type
spray_functions
deckplate_included
connection_size
connection_type
hose_length_in
material
warranty
certifications
ada_compliant
watersense_certified
product_assets.csv
brand_id
product_id
variant_id
asset_id
asset_type
asset_url
asset_position
asset_title
alt_text
image_width
image_height
is_primary
asset_category_normalized
crawl_timestamp
Suggested values for asset_category_normalized:
hero
lifestyle
silo
dimension_diagram
installation
parts_diagram
finish_swatch
feature_infographic
video
unknown
product_documents.csv
brand_id
product_id
variant_id
document_id
document_type_raw
document_type_normalized
document_title
document_url
file_type
language
document_date
crawl_timestamp
Suggested values for document_type_normalized:
spec_sheet
installation_guide
warranty
parts_diagram
care_guide
cad
bim
revit
technical_drawing
unknown
reviews.csv
This is especially important. Please harmonize review data across all sites.
brand_id
product_id
variant_id
review_id
source_review_id
review_url
review_title
reviewer_display_name
reviewer_location
rating
rating_scale
review_date
review_body
pros
cons
verified_purchase_flag
incentivized_review_flag
syndicated_review_flag
helpful_vote_count
not_helpful_vote_count
brand_response_flag
brand_response_text
brand_response_date
source_platform
crawl_timestamp
Use the exact public reviewer name as displayed on the site. If no name is shown, use:
anonymous
qa.csv
brand_id
product_id
variant_id
qa_id
question_id
question_text
question_author_display_name
question_date
answer_id
answer_text
answer_author_display_name
answer_source_type
answer_date
helpful_vote_count
source_url
crawl_timestamp
Suggested values for answer_source_type:
brand
manufacturer
retailer
customer
staff
unknown
crawl_coverage.csv
brand_id
site_section
source_url
crawl_status
fields_expected
fields_extracted
reviews_visible
reviews_extracted
qa_visible
qa_extracted
documents_visible
documents_extracted
render_required
blocked_flag
notes
crawl_timestamp
Suggested crawl_status values:
complete
partial
blocked
not_available
needs_browser_render
manual_review_required
⸻
Timeline
This project requires a 3-day turnaround.
Day 1: Source Mapping and Sample Pull
Deliver:
* Confirmed crawl approach for each site
* 3–5 sample products per brand
* Sample raw data
* Sample normalized rows
* Notes on review widgets, JavaScript rendering, blocked sections, or missing data
Day 2: Full Extraction
Deliver:
* Full catalog pull for target faucet categories
* Product detail data
* Specs
* Assets
* Documents
* Review and Q&A extraction where publicly available
* Raw files organized by brand
Day 3: Normalization, QA, and Final Delivery
Deliver:
* Final normalized CSV/JSON files
* Raw source files
* Coverage report
* Data dictionary
* QA notes
* List of fields that could not be captured and why
⸻
Required Skills
Ideal freelancer should have experience with:
* Web scraping and data extraction
* Python, Playwright, Puppeteer, Selenium, BeautifulSoup, or similar tools
* JavaScript-rendered websites
* Product catalog data
* E-commerce or manufacturer PDPs
* Review and Q&A extraction
* CSV/JSON data normalization
* Data QA and schema mapping
page.
.
⸻
Proposal Instructions
In your proposal, please include:
1. Your experience scraping product catalogs or manufacturer sites.
2. Your experience extracting reviews and Q&A.
3. The tools you would use.
4. Whether you can complete this in 3 calendar days.
5. A short description of how you would preserve both raw data and normalized data.
6. One example of a normalized review schema you have used before, or how you would structure one.
7. Any risks you see with JavaScript-rendered review widgets.
⸻
Success Criteria
The project will be considered successful if:
* Raw data is preserved by brand and source URL.
* Normalized data uses one harmonized structure across all sites.
* Product, review, Q&A, asset, and document fields are consistently mapped.
* Coverage gaps are clearly documented.
* Each row includes a source URL and crawl timestamp.
* The output can be loaded into a dashboard or database without additional restructuring.
* Delivery is completed within 3 calendar days.
⸻
Optional Fixed-Price Milestones
Milestone 1 — Day 1 Sample and Crawl Map:
Sample extraction from all six sites, schema confirmation, and coverage notes.
Milestone 2 — Day 2 Full Raw Pull:
Raw product catalog, PDP, review, Q&A, asset, and document data.
Milestone 3 — Day 3 Normalized Delivery:
Final CSV/JSON/XLSX files, coverage report, and data dictionary.
⸻
Short Screening Questions
1. Can you capture both raw source data and normalized structured data?
2. Can you complete the project in 3 calendar days?
3. Which tools would you use for JavaScript-rendered product and review pages?
4. How would you normalize review title, reviewer name, review body, rating, date, and brand response across different sites?
5. How do you document fields that are visible on a page but fail to extract automatically?
Deschide pe Upwork