Web Scraping Expert — Large-Scale Data Extraction from a Next.js Marketplace

Budget: - HOURLY / FULL_TIME ⭐ 4.98 (81) United States

data-scraping, python, javascript, scrapy-framework, node.js, data-extraction, data-mining, microsoft-excel, php

Project Overview We're looking for an experienced web scraping developer to build a reliable, maintainable data pipeline that extracts publicly available product and distributor data from a large e-commerce / electronic component marketplace. All data we want is publicly accessible — we're focused on doing this reliably and responsibly, at scale and within reasonable request limits. What We Need We want to collect structured data and apply flexible filtering across the marketplace, including: Manufacturer Part Numbers (MPNs) — the part numbers listed under each manufacturer Parts per manufacturer — how many parts/SKUs each manufacturer carries Complete catalog coverage — surfacing the full set of public records, including listings that aren't shown in the default browse view (e.g., out-of-stock or inactive parts that still have public pages) The goal is a clean, deduplicated dataset we can filter and query on the dimensions above. Technical Considerations The target is a large, JavaScript-heavy Next.js / React marketplace. We're open to an approach that works with the site's underlying data/JSON layer rather than only scraping rendered HTML. The site has aggressive anti-bot and rate-limiting measures, so we need someone experienced with resilient, respectful scraping at scale — proper session handling, sensible request pacing, and proxy management. The full public dataset isn't easily reachable through normal page navigation. We need a strategy that reliably captures the complete set of public records, including listings that aren't surfaced in the default browse view. Ideal Skills Proven experience scraping large, well-protected sites with sophisticated anti-bot systems — please cite a specific example in your application Strong experience with JavaScript-heavy sites (Playwright/Puppeteer, Scrapy, or similar) Reverse-engineering of internal/undocumented API endpoints and Next.js data payloads Proxy rotation, session/cookie management, and request-pacing strategy Data cleaning, deduplication, and structured output (CSV / JSON / database) Comfortable working only with publicly accessible data and respecting reasonable request limits Code Ownership & Workflow This scraper is core infrastructure we'll continue to build on internally, so a few requirements: All work is work-for-hire — full source code, IP, and rights transfer to us on payment. Development happens directly in our private GitHub repository (we'll add you as a collaborator). Please commit incrementally with clear messages, not one final dump. We need a documented system architecture alongside the code: how the pipeline is structured, where the extraction logic lives, how session/rate-limit handling works, how to configure proxies, and how to run and extend it. The goal is that our in-house team can maintain and scale this without you after handoff. No reuse of this codebase or the extracted data for other clients or projects. Deliverables A working scraper committed to our GitHub repo, with documentation Clean structured output filterable by distributor, part count, and manufacturer A documented method for achieving complete catalog coverage at scale A clear system architecture write-up for in-house maintenance To Apply Please briefly describe: A similar scraping project you've delivered — especially involving large sites with strong anti-bot measures Your recommended approach for capturing a complete dataset where normal page navigation falls short Your typical tech stack and an estimated timeline

Öppna på Upwork