Data Engineer — Public-Data API Ingestion + Postgres/JSONB Storage

Rozpočet: $10.0 - $35.0 HOURLY / FULL_TIME ⭐ 4.98 (9) United States

python, big-data, postgresql, restful-api

We're building the data layer for a content and analytics platform and need an experienced data engineer for ongoing work. Your job is to pull data from external sources via API, store it sensibly in PostgreSQL, and make it available to our headless CMS (Payload) so it can be used downstream in content generation and analytical modeling. This is a steady, build-and-extend relationship, not a single project. What you'll do * Build and maintain ingestion pipelines that pull from external APIs and bulk data sources — handling pagination, rate limits, auth, retries, and incremental/idempotent loads. * Design the storage model in PostgreSQL: decide what gets stored as JSONB (bulk payloads we don't need to query field-by-field) versus what gets promoted to typed columns and a defined schema (fields we filter, join, or call on directly). * Expose the right fields to our headless CMS (Payload) so they can be referenced in structured content fields and downstream processes. * Start with our three core data sources, then extend the same patterns to new domains (e.g., health, education, and others) as we grow. * Document the pipelines and schema so the work is maintainable by the rest of the team. Our core data sources (familiarity is a strong plus, not required) * U.S. Census American Community Survey (ACS), including PUMS microdata * U.S. State voter files * U.S. Survey/crosswalk datasets (e.g., CCES) If you haven't touched these specific sources but have ingested and modeled other large public datasets, please still apply — the patterns transfer. You should be strong in * Python for ingestion and transformation (ETL/ELT patterns, idempotency, incremental loads). * PostgreSQL in depth: JSONB, GIN indexing, when to normalize vs. store as JSON, query performance. * Consuming REST APIs at scale: pagination, rate limiting, auth, error handling. * Working with large, messy public datasets and turning them into something queryable. * Clean, documented, maintainable code and a Git-based workflow. Nice to have * Comfort with TypeScript/Node and headless CMS work (Payload specifically is a bonus). * Cloud deployment experience (we run on GCP / Cloud Run). * Experience with census microdata, voter files, or survey research data. Engagement Ongoing. Expect a steady stream of well-scoped tasks. Please share your weekly availability. To apply Tell us briefly: (1) a pipeline you built that ingested a large external dataset via API into PostgreSQL — what the source was and how you handled scale and problems that surfaced, and (2) how you decide whether a given piece of incoming data should live in a JSONB column or be modeled as typed columns and how you solved related problems. A link or code sample is a plus.

Otevřít na Upwork