Python/PostgreSQL Engineer for China Patent Data Pipeline Productionization, QA, and Scale-Up
Bütçe: $800.0
FIXED /
⭐ 5.00 (2)
IND
api-integration, python, etl-pipelines, postgresql
We need a Python/PostgreSQL data engineer to implement the China-only patent data integration for an existing patent analytics platform.
This is a fixed-price China milestone for IPPH patent data.
NOTE : Japan and Korea loads are excluded based on conversations with Krish and may be handled later as separate follow-on work.
The existing platform already has Python ingestion patterns, PostgreSQL, bronze/silver processing, MinIO/file-ingestion infrastructure, translation infrastructure, assignee standardization, and a dashboard. The goal is to extend the existing system, not rebuild it from scratch.
Total Budget:
$800 fixed price
Milestone 1: IPPH File Ingestion, XML Parsing, and Initial Database Load
Budget: $300
Scope:
- Inspect the IPPH sample / initial delivery package structure.
- Use the existing MinIO/file-ingestion pattern.
- Handle package manifests and nested ZIP/XML packages.
- Parse key China patent fields where available:
- publication identifiers
- application identifiers
- claims
- claim numbers
- independent/dependent claim indicators
- claim counts
- description sections
- bibliographic metadata
- legal status metadata
- current owner / assignee metadata
- applicant/inventor metadata
- drawings metadata
- rich citation fields
- Load raw and parsed data into PostgreSQL following the existing bronze/silver architecture.
- Preserve source traceability: source file, package date, document path, document ID, load timestamp, and processing status.
Acceptance Criteria:
- Provided IPPH sample files can be processed end to end.
- Parsed records are loaded into PostgreSQL or clearly structured for PostgreSQL loading.
- Key source fields are mapped and documented.
- Failed/partial records are logged with useful error messages.
Milestone 2: Delta Handling, Translation, and Assignee Standardization (Full Load)
Budget: $300
Scope:
- Implement CREATE/UPDATE/DELETE handling for the confirmed IPPH package format.
- Track processed packages/documents to avoid duplicate loads on rerun.
- Add retry-safe/idempotent behavior where practical.
- Integrate Chinese-to-English translation using the existing approved model endpoint / infrastructure.
- Store original Chinese text, English translation, translation status, model/prompt/version metadata, and errors.
- Integrate Chinese applicant/current-owner/assignee names into the existing assignee standardization pipeline.
- Preserve raw Chinese names and translated/normalized names.
- Add confidence/status fields or review flags where useful.
Acceptance Criteria:
- Rerunning the job does not duplicate already processed records.
- CREATE/UPDATE/DELETE records are handled according to the confirmed IPPH package semantics.
- Chinese text is routed through the agreed translation endpoint and stored with status metadata.
- Chinese assignee data flows through the existing standardization process.
Milestone 3: Dashboard Integration, QA, Tests, and Handover
Budget: $200
Scope:
- Make China data visible in the existing dashboard.
- Reuse existing dashboard patterns; no dashboard rebuild.
- Ensure China records can be filtered/viewed in relevant existing views.
- Surface key parsed fields and standardized assignee information where supported by the current dashboard.
- Add focused tests using sample files.
- Provide validation counts:
- files processed
- documents parsed
- records loaded
- translations attempted/succeeded/failed
- assignee records processed
- Provide runnable setup instructions and short handover documentation.
Acceptance Criteria:
- All ingested China records after assignee standardization and translation are visible in the existing dashboard.
- Basic tests pass against sample data.
- A run summary/log is available for validation.
- Documentation is sufficient for another developer to run, monitor, and validate the pipeline.
Upwork'te aç