Python developer for parsing HTML regulatory documents
Rozpočet: $1500.0
FIXED /
⭐ 4.99 (39)
Serbia
python
We turn structured documents from public websites into clean, structured JSON. For each source you'll receive a list of sample URLs and write one HTML parser for that source - mostly by pointing CSS selectors at the page's headings and body so the document comes out as clean, structured JSON (chapters, sections, articles).
This is mostly CSS selectors, with some Python. If you're comfortable reading HTML in your browser's dev tools and writing selectors like div.content h2.section-title, you can do this work.
The development loop is fast and fully offline
You capture a copy of each page yourself and work against it offline. No logins, and no access to any client app or system.
Your work is checked by automated validators plus a quick manual review. When it passes both, it's merged.
How it pays
Fixed price: $25 per validated parser. No hourly tracking.
Billable quickly after parser is approved
This is high, steady volume - on the order of 200 parsers in June alone, with more after - so there's ongoing work for people who do good work.
Step one is a paid screening task
The first thing you do is a real document parser.
We hand you one source; you deliver it through the same pipeline everyone uses.
If it passes validation and a quick review, you get paid for it and you're in.
Required skills
Comfort reading HTML and writing CSS selectors (the core skill)
Basic Python
Familiarity with BeautifulSoup
Basic git / pull-request workflow (clone, branch, open a PR)
Working style
Fully remote, asynchronous, your own hours.
overlap with European business hours (CET) is required for quick back-and-forth
example url: https://www.irishstatutebook.ie/eli/2018/act/25/enacted/en/print.html
a preview of a parsed document is in the attachment
Otevřít na Upwork