← Trabalhos

Python/PostgreSQL Engineer for China Patent Data Pipeline Productionization, QA, and Scale-Up

Orçamento: $800.0 FIXED / ⭐ 5.00 (2) IND

api-integration, python, etl-pipelines, postgresql

We need a Python/PostgreSQL data engineer to implement the China-only patent data integration for an existing patent analytics platform. This is a fixed-price China milestone for IPPH patent data. NOTE : Japan and Korea loads are excluded based on conversations with Krish and may be handled later as separate follow-on work. The existing platform already has Python ingestion patterns, PostgreSQL, bronze/silver processing, MinIO/file-ingestion infrastructure, translation infrastructure, assignee standardization, and a dashboard. The goal is to extend the existing system, not rebuild it from scratch. Total Budget: $800 fixed price Milestone 1: IPPH File Ingestion, XML Parsing, and Initial Database Load Budget: $300 Scope: - Inspect the IPPH sample / initial delivery package structure. - Use the existing MinIO/file-ingestion pattern. - Handle package manifests and nested ZIP/XML packages. - Parse key China patent fields where available: - publication identifiers - application identifiers - claims - claim numbers - independent/dependent claim indicators - claim counts - description sections - bibliographic metadata - legal status metadata - current owner / assignee metadata - applicant/inventor metadata - drawings metadata - rich citation fields - Load raw and parsed data into PostgreSQL following the existing bronze/silver architecture. - Preserve source traceability: source file, package date, document path, document ID, load timestamp, and processing status. Acceptance Criteria: - Provided IPPH sample files can be processed end to end. - Parsed records are loaded into PostgreSQL or clearly structured for PostgreSQL loading. - Key source fields are mapped and documented. - Failed/partial records are logged with useful error messages. Milestone 2: Delta Handling, Translation, and Assignee Standardization (Full Load) Budget: $300 Scope: - Implement CREATE/UPDATE/DELETE handling for the confirmed IPPH package format. - Track processed packages/documents to avoid duplicate loads on rerun. - Add retry-safe/idempotent behavior where practical. - Integrate Chinese-to-English translation using the existing approved model endpoint / infrastructure. - Store original Chinese text, English translation, translation status, model/prompt/version metadata, and errors. - Integrate Chinese applicant/current-owner/assignee names into the existing assignee standardization pipeline. - Preserve raw Chinese names and translated/normalized names. - Add confidence/status fields or review flags where useful. Acceptance Criteria: - Rerunning the job does not duplicate already processed records. - CREATE/UPDATE/DELETE records are handled according to the confirmed IPPH package semantics. - Chinese text is routed through the agreed translation endpoint and stored with status metadata. - Chinese assignee data flows through the existing standardization process. Milestone 3: Dashboard Integration, QA, Tests, and Handover Budget: $200 Scope: - Make China data visible in the existing dashboard. - Reuse existing dashboard patterns; no dashboard rebuild. - Ensure China records can be filtered/viewed in relevant existing views. - Surface key parsed fields and standardized assignee information where supported by the current dashboard. - Add focused tests using sample files. - Provide validation counts: - files processed - documents parsed - records loaded - translations attempted/succeeded/failed - assignee records processed - Provide runnable setup instructions and short handover documentation. Acceptance Criteria: - All ingested China records after assignee standardization and translation are visible in the existing dashboard. - Basic tests pass against sample data. - A run summary/log is available for validation. - Documentation is sufficient for another developer to run, monitor, and validate the pipeline.
Abrir na Upwork