Scrape Data/Photos From A Webpage - Cronjob/Python?
Budget: $500.0
FIXED /
⭐ 4.98 (56)
United States
python, data-scraping, data-extraction
I need you to go to a page like this (EXAMPLE): https://www.machinerytrader.com/listings/for-sale/21st-century-equipment-llc-alliance-ne/construction-equipment?LocationID=350000078660
This is a dealer's page with their inventory. I want you to extract the year, make, model, miles, AND price.
Then you will send me a CSV file with all the information.
The goal is to send you a link, and you collect all the information above and send it to me via CSV file.
I'd like to automate this.
You need to build and maintain an automated system that collects structured listing data from a set of public websites, cleans and standardizes it, and refreshes it on a recurring schedule. The goal is a reliable, growing dataset — not a one-time export.
What you'll do:
- Build automated scrapers that pull structured data (categories, brand/model, year, condition, price, specs, images) from multiple public websites.
- Handle sites that use bot-protection reliably and responsibly (rate limiting, retries, proxies as needed).
- Normalize and de-duplicate data across different sources into one consistent format — e.g. recognizing that the same item listed two different ways is the same item.
- Set up the pipeline to **run automatically on a weekly schedule**, appending to a historical dataset so it grows over time rather than overwriting.
- Store everything in a clean database and produce summary statistics from it.
- Deliver documented, runnable source code not just spreadsheets of output so the system can be maintained and extended.
Must-have:
- Strong Python and hands-on scraping experience (Playwright, Selenium, DrissionPage, BeautifulSoup, or similar).
- Proven experience getting past anti-bot measures reliably at scale.
- Data cleaning, normalization, and cross-source record matching.
- Databases (PostgreSQL/MySQL) and scheduled pipelines (cron, Airflow, or similar).
- A track record of shipping maintainable, well-documented code.
Nice-to-have:
- Statistical analysis / building estimate or scoring models from aggregated data.
- Building a simple API to query the dataset.
- Experience enriching data with AI/LLMs.
To apply, please answer briefly:
1. How do you keep a scraper running reliably against a site that actively blocks bots?
2. If the same product is listed on two sites with slightly different names and specs, how do you detect they're the same item?
3. Share an example of a pipeline you built that runs on a schedule and grows a dataset over time.
Openen op Upwork