Backend Data Acquisition, also known as ETL for a novel startup
Bütçe: $30.0 - $40.0
HOURLY / AS_NEEDED
⭐ 4.99 (12)
United States
etl-pipelines, python, data-mining
Data Acquisition Engineer — Web Scraping & ETL (Python, Playwright/Scrapy, etc.)
i. About the project
We run a subscription service for collectibles. Our core need is a reliable data pipeline that collects auction listings from a defined set of auction-house websites, normalizes the data into a consistent structure, and keeps it current. This is a long-term, ongoing project with steady work for the right person.
ii. What we need from YOU specifically
This role is focused on the **Extract** and **Transform** stages of ETL — getting structured data reliably out of websites that were not built to be scraped, and keeping it current as new auctions are posted. We have other team members for front-end and other work. We need a specialist who is genuinely strong at data acquisition. Please do NOT apply as a full-stack generalist; we want depth in scraping ability, not breadth.
iii. Core responsibilities
- Build and maintain Python scrapers that extract auction-lot data (lot number, title, description, estimates, images, sale dates, bidding deadlines) from a set of auction-house websites, each with a different structure.
- Use appropriate tools to handle JavaScript-rendered, dynamic pages where needed (browser-automation engines such as Playwright or Selenium), and lighter approaches for simpler static pages — your judgment on what each site requires.
- Output clean, well-structured JSON with consistent fields across all sources.
- Write transformation logic that normalizes inconsistent catalog text into standard fields.
- **Reliable catalog discovery is the priority:** the pipeline must detect when a new auction has been posted and capture its catalog well before the sale date. Per-lot churn is minor (lots are rarely added late, occasionally withdrawn), so we need dependable periodic re-checks, not high-frequency real-time polling.
- Handle pagination, responsible rate-limiting, retries, and error logging.
- Document each scraper so it can be maintained by others.
iv. Respectful scraping only
- Public pages only. **No logging in, no credential use, no circumvention of access controls.**
- Built-in rate limiting and polite request pacing are mandatory. We have relationships with these auction houses. You must never get us blocked or flagged.
- If a site has strong anti-scraping protection, the answer is to flag it for us to pursue an API arrangement — NOT to defeat the protection.
v. Required skills
- **Strong Python** — this is the one firm requirement.
- Demonstrated, real-world web scraping of dynamic/JS-heavy sites (please show examples).
- Solid command of the scraping toolkit and good judgment about which tool fits which site. Strong candidates typically work with some combination of a browser-automation engine (Playwright, Selenium, Puppeteer), a scraping framework (e.g. Scrapy), a plain HTTP fetcher (requests, httpx) for static pages, and a parser (BeautifulSoup, lxml, parsel). **We are not married to any specific library** — if you have a different approach you believe is better for a multi-site auction pipeline, tell us why. We care about reliable results and respectful scraping, not a particular tool.
- Clean JSON output and data-normalization experience.
- **Clear, fluent written AND spoken English is essential.** We work closely over Slack and video calls and need precise, direct, real-time communication.
vi. Nice to have (not required)
- AWS familiarity (S3, Lambda, basic scheduling) — we can handle infrastructure ourselves but it helps.
- Experience turning a scraper into a scheduled, repeatable job.
- Some backend logic experience (data matching, notifications) is a plus for possible future work — but this posting is for data acquisition, not full-stack.
vii. NOT required
- Full-stack. We are NOT looking for a full-stack developer.
- Front-end / React.
- Design.
viii. Interviews
All interviews are conducted over **video with camera on.** Please be prepared to discuss your scraping experience live and in English, and to talk through your past projects in real time. We do not like the camera turned off as people may just use a translation program who do not speak English well.
ix. Screening & test task
First, in your proposal, link any **existing scraper code** you can show (GitHub, portfolio, past work). Git Hub is preferred.
If we move forward, there is a small, **paid**, bounded test task before any ongoing engagement: scrape one specific auction page we name privately, and return clean JSON plus a short note on your approach. This proves fit for both sides quickly.
x. To apply
In your proposal, briefly answer:
1. Describe one dynamic/JS-heavy site you have scraped and how you handled it.
2. What is your preferred tech stack for a multi-site scraping pipeline, and why? (Tell us your approach — we want to see your reasoning, not a specific buzzword.)
3. How do you rate-limit and avoid getting blocked?
xi. Please start your reply with the word **"BROCCOLI"** (no “ or *) so we know you read this posting.
Upwork'te aç