Fix Python NER pipeline for anonymising names in Excel files

Rozpočet: $30.0 FIXED / ⭐ 4.98 (14) Algeria

python, machine-learning, microsoft-excel, natural-language-processing, jupyter

I have a data privacy project involving Excel files (.xlsx/.xlsb) containing care log entries with people's names mixed into free-text narrative columns. I need help completing and debugging a Python pipeline that anonymises (encodes) these names into placeholder codes, and reverses (decodes) them back to the original names later. The challenge: the same person's name appears in many different forms throughout a file — full name, first name only, initials, lowercase, and occasional typos/misspellings. The pipeline needs to recognise all these variants as the same person and assign them a single consistent code (e.g. CLIENT_001, STAFF_002), not a different code for each spelling. What I need help with: Debugging name-clustering logic so name variants reliably merge into one identity (currently some real names get split into multiple codes, and occasionally the model picks up garbled/incorrect text as a "name") Improving performance (NER currently runs slower than it should on larger files) General code review and robustness improvements to the existing Python notebook Tech stack: Python, Jupyter, spaCy, Microsoft Presidio, openpyxl, pywin32 (Excel COM) Looking for someone experienced with NLP/NER pipelines and Python data processing. Data being used is synthetic/sample data.

Otvoriť na Upwork