← Trabajos

Parse Florida Corporate Registry Files

Presupuesto: $100.0 FIXED / ⭐ 0.00 (0) USA

python, pandas, data-extraction, etl

I need 10 fixed-width Florida corporate registry text files parsed into a clean spreadsheet, with one calculated column added for business age. ATTACHMENTS, READ BOTH BEFORE APPLYING Two files are attached to this job. "Corporate File Definitions" is Florida's official record layout PDF; it shows the exact character position and length of every field in the data files. "Data Access & Field Reference" is a PDF document I wrote covering how you'll receive the actual data files and the full extraction and output spec. Review both before writing your proposal. THE FILES The data is Florida's public corporate registry (Sunbiz), provided as 10 fixed-width ASCII text files, each approximately 1.7GB, no delimiters. Every field sits at a fixed character position on a 1,440-character record. These files are too large to attach here; the PDF document explains exactly how you'll get them on award, either via a shared Google Drive folder or downloading directly from Florida's public data server yourself. The layout PDF is the only source of truth for field positions. Do not guess. WHAT TO EXTRACT Pull only the fields listed in the attached PDF document, using the character positions in the layout PDF: Corporation Number, Corporation Name, Status, Filing Type, Address, City, State, Zip, File Date (formation date), and Officer 1 Name only. Ignore officer fields 2 through 6 entirely. FILTER Keep only rows where Status equals "A" (active). Drop everything else. CALCULATE Add one new column called business_age_years: today's date minus the File Date, expressed in whole years. If File Date is missing or invalid for a row, leave this blank rather than estimating. OUTPUT Deliver one file, CSV or .xlsx, with the columns listed in the attached PDF document. Trim trailing spaces from all text fields. Sort by business_age_years descending, oldest businesses first. Include a short note stating the row count before and after filtering. QUALITY CHECK Before delivering, pick 10 rows at random from your output and verify them by hand on the free public lookup at search.sunbiz.org. Confirm the corporation name, formation date, and officer name match. Report how many of the 10 were correct. This is part of the deliverable, not optional. PRICE AND TIMELINE Fixed price $100. Deliverable: the formatted spreadsheet, your 10-row QC note, and the Python script you used. Target turnaround 3 business days from receipt of files. TO APPLY Answer both in your proposal, or it will not be read. First, have you parsed fixed-width or positional flat files in Python before, and what library do you use (pandas, struct, or other)? Second, roughly how long would parsing and filtering about 17GB across 10 files take on your setup?
Abrir en Upwork