Python Developer Needed for Legislative Data Pipeline, XML Parsing, and NLP-Based Bill Analysis
Buget: $10.0 - $40.0
HOURLY / FULL_TIME
⭐ 5.00 (4)
United States
python, data-scraping, data-mining
We are seeking an experienced Python developer to build a reproducible workflow for collecting, processing, and analyzing legislative bill data from a recent state legislative session.
The project involves parsing legislative XML files, constructing a bill-level database, enriching records with legislator metadata, and performing initial text-based classification and exploratory analysis.
### Scope of Work
* Parse a master legislative XML index file containing all measures introduced during a legislative session
* Extract and structure bill-level metadata, including:
* Bill number
* Title
* Sponsor information
* Committee assignments
* Status/history information
* Related document links
* Apply filtering and data-cleaning procedures to create a research-ready dataset
* Merge bill records with legislator roster datasets to enrich sponsor information
* Automate retrieval of linked bill-history XML files and associated bill-text documents
* Build a reproducible data-processing pipeline that can be reused for future legislative sessions
* Perform initial NLP-based topic classification and content categorization of bill text
* Generate descriptive summaries and exploratory statistics across bills, sponsors, committees, and policy topics
### Technical Requirements
Required experience:
* Python
* pandas
* requests
* lxml
* BeautifulSoup
* Regular Expressions (regex)
* XML parsing and data extraction
* Data cleaning and transformation workflows
* Relational data merging and normalization
Preferred experience:
* Natural Language Processing (NLP)
* Topic modeling or text classification
* Document processing (PDF/XML)
* Exploratory data analysis and visualization
* Reproducible research workflows and project documentation
### Deliverables
* Fully documented Python workflow
* Clean bill-level analytical dataset
* Automated data collection and processing scripts
* Topic-classified bill dataset
* Summary statistics and exploratory analytical outputs
* Documentation explaining workflow execution and data structure
### Additional Information
To keep the initial posting concise, detailed source materials, sample files, data schemas, and project-specific documentation will be shared only with shortlisted candidates.
Selected candidates will receive access to representative XML files, supporting datasets, and additional project requirements necessary for preparing an accurate implementation plan and estimate.
The solution should be modular, reproducible, and designed so that additional legislative sessions can be processed with minimal modifications.
Deschide pe Upwork