← Вакансии

Expert Scrapper — Bulk Image Download from Database

Бюджет: $1000.0 FIXED / ⭐ 5.00 (4) United States

data-scraping, python, data-mining

Summary I need an experienced scraper to handle bulk retrieval of scanned document images, organize them into a structured directory, extract a small number of fields from each image, and produce a manifest linking every file to its source identifier and metadata. High-volume, long-running task requiring care, file integrity, and disciplined monitoring of a multi-week pipeline. Scope of work 1. Scope is limited to four jurisdictions — California, New York City, Ohio, and Michigan — within a single collection. 2. Download the FULL-RESOLUTION images, not thumbnails. Throughput is expected to be ~5–6 sec/image, so plan for a continuous multi-week run (~2–3 weeks). 3. Persistent task database with resume support: an interruption or block must not require re-downloading completed files. 4. Store images in a directory hierarchy mirroring the source collection structure, sharded to avoid filesystem performance issues at scale. 5. For each image, record in a manifest (CSV or parquet) with requested variables 6. Verify file integrity (non-zero size, valid format) and re-download failures. 7. Deliver images to a researcher-provided private AWS S3 bucket. Provide weekly progress reports: images downloaded, error rate, estimated completion date, and any issues encountered. Deliverables (two milestones, 50% each, subject to review) - Milestone 1 — New York City + Michigan: manifest + full-res images to S3. - Milestone 2 — Ohio + California: manifest + full-res images to S3. Required skills 1. Strong Python, including authenticated session handling (requests / Playwright or equivalent). 2. Recoverable / resumable rate-limited bulk-download pipelines (handles network interruptions, server errors, and auth refresh without losing progress). 3. File-system organization at scale (millions of files; directory sharding). 4. In-image text / OCR extraction. 5. Logging and progress-monitoring discipline. 6. Familiarity with manifest formats (CSV/parquet) and metadata management.
Открыть заказ