← İşler

Python Developer Needed for Bulk Raw TXT Retrieval and Merging

Bütçe: $5.0 FIXED / ⭐ 4.99 (59) USA

data-extraction, scripting, api-integration, crawlers, data-scraping, automation, data-processing, file-management, etl

If this task can be successfully completed, along with payment I will provide a very positive review along with potential future works with higher budget. I need a Python developer to run a bulk raw text retrieval workflow from a public archive/API structure. The goal is to collect many raw .txt documents, download their full text contents, and merge them into one or more large combined “mega TXT” files. The source data comes from the SEC’s public company filing archive. Each company has a unique company ID called a CIK, which is used to access that company’s filing records. The starting ID list comes from SEC’s ticker.txt, which currently contains about 12,000 ticker-to-CIK entries. The workflow should be able to process this list, starting with a 3-CIK test run before scaling to the full file. For the initial test, please use these 3 padded company IDs: 0000320187 0000320193 0000789019 For each company ID, request its public submissions JSON, for example: https://data.sec.gov/submissions/CIK0000320187.json From that JSON, extract the filing/accession records and generate the related raw TXT archive URLs, such as: https://www.sec.gov/Archives/edgar/data/320187/000032018726000037/0000320187-26-000037.txt The job is to: 1. Use the CIK list I will provide 2. Request each CIK’s submissions JSON 3. Generate the matching raw TXT archive URLs 4. Retrieve/download the full raw TXT contents 5. Merge the TXT contents into one or more large “mega TXT” files 6. Keep a separate log for completed, skipped, failed, and duplicate records 7. Handle large files efficiently without loading everything into memory You would be responsible for running the workflow and delivering the final combined TXT output files, along with the script used. The output does not need to be organized by filing type unless easy to add. The main goal is bulk retrieval and merging of raw TXT data into large combined files. After this 3-company test is confirmed, the same workflow should be reusable for the larger CIK list (12,000). All further details will be provided in direct messages.
Upwork'te aç