Data pipeline engineer to scale ingestion of publicly available job data (APIs + open web)

Rozpočet: $15.0 - $50.0 HOURLY / FULL_TIME ⭐ 4.94 (44) United States

crawlers, scrapy-framework, python, distributed-computing, api-development

We're looking for an experienced data pipeline engineer to architect and scale the ingestion system behind our job data platform. The individual collectors mostly exist — what we need is the layer above them: orchestration, reliability, cost control, and self-healing so the system runs at scale without constant babysitting. Hirebase is a labor market data company. We collect publicly available job postings from official job-board APIs, partner feeds, sitemaps, and open company career pages, and turn them into clean, structured data. That data powers APIs used by recruiting, sales, and financial customers — so coverage, freshness, and data quality directly drive revenue. Your work will determine whether we can scale to hundreds of thousands of sources while keeping infrastructure costs sane. **How we work (and expect you to work):** We collect only publicly available data. We prioritize official APIs and public endpoints wherever they exist, respect robots.txt and published rate limits, and we do not access login-walled, paywalled, or otherwise access-controlled content. This is an ongoing hourly engagement, starting around 20–30 hours/week with room to grow. You'd work directly with the founders and alongside an existing engineer — you set the architecture, and you're not alone on implementation. **What you'd own:** - Orchestration and scheduling across 200,000+ sources: queue/worker design, partitioning strategy, stateless and disposable workers - Reliability and respectful collection: sensible per-source rate limiting, backoff, and prioritizing official APIs and public endpoints - Failure detection and self-healing: error classification (rate-limited vs. source-changed vs. parser bug vs. network), coverage monitoring (expected vs. observed volume), alerts that only fire when they matter - Cost control across compute, retries, and storage (MongoDB) **Must-have experience:** - You've personally operated a large-scale data collection system across thousands of heterogeneous public sources — not just built one, but kept it running and dealt with things silently breaking - Strong, hands-on experience with high-volume HTTP data collection and structured parsing at scale (Python, Scrapy or similar) - Distributed queue/worker systems with per-source rate limiting and idempotent jobs Backgrounds we tend to like: data-infrastructure companies, alt-data or market-intelligence providers, SEO/analytics tooling, or a serious solo operation. We care about demonstrated skill and judgment, not pedigree. One honest note on fit: if your experience is mainly batch ETL or warehouse pipelines on already-clean data (Airflow/dbt/Snowflake), this is probably a different job than the one you're great at — high-volume, heterogeneous public-data ingestion is its own discipline. Shortlisted candidates will get a small paid trial project on real public sources (paid at your proposed rate) before we commit to the longer engagement. We'll evaluate coverage, cost efficiency, and the quality of your architectural reasoning — not just whether the code runs. If this sounds like your kind of problem, please answer the screening questions below with specifics — they matter more to us than your profile stats — and include links to anything you can share publicly. Questions about the project are welcome in your proposal.

Otevřít na Upwork