Fix Python NER pipeline for anonymising names in Excel files
Rozpočet: $30.0
FIXED /
⭐ 4.98 (14)
Algeria
python, machine-learning, microsoft-excel, natural-language-processing, jupyter
I have a data privacy project involving Excel files (.xlsx/.xlsb) containing care log entries with people's names mixed into free-text narrative columns. I need help completing and debugging a Python pipeline that anonymises (encodes) these names into placeholder codes, and reverses (decodes) them back to the original names later.
The challenge: the same person's name appears in many different forms throughout a file — full name, first name only, initials, lowercase, and occasional typos/misspellings. The pipeline needs to recognise all these variants as the same person and assign them a single consistent code (e.g. CLIENT_001, STAFF_002), not a different code for each spelling.
What I need help with:
Debugging name-clustering logic so name variants reliably merge into one identity (currently some real names get split into multiple codes, and occasionally the model picks up garbled/incorrect text as a "name")
Improving performance (NER currently runs slower than it should on larger files)
General code review and robustness improvements to the existing Python notebook
Tech stack: Python, Jupyter, spaCy, Microsoft Presidio, openpyxl, pywin32 (Excel COM)
Looking for someone experienced with NLP/NER pipelines and Python data processing. Data being used is synthetic/sample data.
Otvoriť na Upwork