Expert Scrapper — Bulk Image Download from Database
Budget: $1000.0
FIXED /
⭐ 5.00 (4)
United States
data-scraping, python, data-mining
Summary
I need an experienced scraper to handle bulk retrieval of scanned document images, organize them into a structured directory, extract a small number of fields from each image, and produce a manifest linking every file to its source identifier and metadata. High-volume, long-running task requiring care, file integrity, and disciplined monitoring of a multi-week pipeline.
Scope of work
1. Scope is limited to four jurisdictions — California, New York City, Ohio, and Michigan — within a single collection.
2. Download the FULL-RESOLUTION images, not thumbnails. Throughput is expected to be ~5–6 sec/image, so plan for a continuous multi-week run (~2–3 weeks).
3. Persistent task database with resume support: an interruption or block must not require re-downloading completed files.
4. Store images in a directory hierarchy mirroring the source collection structure, sharded to avoid filesystem performance issues at scale.
5. For each image, record in a manifest (CSV or parquet) with requested variables
6. Verify file integrity (non-zero size, valid format) and re-download failures.
7. Deliver images to a researcher-provided private AWS S3 bucket. Provide weekly progress reports: images downloaded, error rate, estimated completion date, and any issues encountered.
Deliverables (two milestones, 50% each, subject to review)
- Milestone 1 — New York City + Michigan: manifest + full-res images to S3.
- Milestone 2 — Ohio + California: manifest + full-res images to S3.
Required skills
1. Strong Python, including authenticated session handling (requests / Playwright or equivalent).
2. Recoverable / resumable rate-limited bulk-download pipelines (handles network interruptions, server errors, and auth refresh without losing progress).
3. File-system organization at scale (millions of files; directory sharding).
4. In-image text / OCR extraction.
5. Logging and progress-monitoring discipline.
6. Familiarity with manifest formats (CSV/parquet) and metadata management.
Openen op Upwork