Web Scraping Expert — Large-Scale Data Extraction from a Next.js Marketplace
Budget: -
HOURLY / FULL_TIME
⭐ 4.98 (81)
United States
data-scraping, python, javascript, scrapy-framework, node.js, data-extraction, data-mining, microsoft-excel, php
Project Overview
We're looking for an experienced web scraping developer to build a reliable, maintainable data pipeline that extracts publicly available product and distributor data from a large e-commerce / electronic component marketplace. All data we want is publicly accessible — we're focused on doing this reliably and responsibly, at scale and within reasonable request limits.
What We Need
We want to collect structured data and apply flexible filtering across the marketplace, including:
Manufacturer Part Numbers (MPNs) — the part numbers listed under each manufacturer
Parts per manufacturer — how many parts/SKUs each manufacturer carries
Complete catalog coverage — surfacing the full set of public records, including listings that aren't shown in the default browse view (e.g., out-of-stock or inactive parts that still have public pages)
The goal is a clean, deduplicated dataset we can filter and query on the dimensions above.
Technical Considerations
The target is a large, JavaScript-heavy Next.js / React marketplace. We're open to an approach that works with the site's underlying data/JSON layer rather than only scraping rendered HTML.
The site has aggressive anti-bot and rate-limiting measures, so we need someone experienced with resilient, respectful scraping at scale — proper session handling, sensible request pacing, and proxy management.
The full public dataset isn't easily reachable through normal page navigation. We need a strategy that reliably captures the complete set of public records, including listings that aren't surfaced in the default browse view.
Ideal Skills
Proven experience scraping large, well-protected sites with sophisticated anti-bot systems — please cite a specific example in your application
Strong experience with JavaScript-heavy sites (Playwright/Puppeteer, Scrapy, or similar)
Reverse-engineering of internal/undocumented API endpoints and Next.js data payloads
Proxy rotation, session/cookie management, and request-pacing strategy
Data cleaning, deduplication, and structured output (CSV / JSON / database)
Comfortable working only with publicly accessible data and respecting reasonable request limits
Code Ownership & Workflow
This scraper is core infrastructure we'll continue to build on internally, so a few requirements:
All work is work-for-hire — full source code, IP, and rights transfer to us on payment.
Development happens directly in our private GitHub repository (we'll add you as a collaborator). Please commit incrementally with clear messages, not one final dump.
We need a documented system architecture alongside the code: how the pipeline is structured, where the extraction logic lives, how session/rate-limit handling works, how to configure proxies, and how to run and extend it. The goal is that our in-house team can maintain and scale this without you after handoff.
No reuse of this codebase or the extracted data for other clients or projects.
Deliverables
A working scraper committed to our GitHub repo, with documentation
Clean structured output filterable by distributor, part count, and manufacturer
A documented method for achieving complete catalog coverage at scale
A clear system architecture write-up for in-house maintenance
To Apply
Please briefly describe:
A similar scraping project you've delivered — especially involving large sites with strong anti-bot measures
Your recommended approach for capturing a complete dataset where normal page navigation falls short
Your typical tech stack and an estimated timeline
Öppna på Upwork