AI-Assisted Data Room File Organizer
Budget: $200.0
FIXED /
⭐ 3.81 (277)
United States
microsoft-excel, data-entry, python, microsoft-word
Create an AI-Assisted Static Data Room Organizer for Real Estate Development Documents.
I need a contractor to build a practical tool that can take a large master folder of mixed project documents and automatically organize them into a data-room-style folder structure for a real estate development / infrastructure project. The goal is to avoid manually opening and sorting hundreds of files.
The tool should process a batch of documents, classify them by subject matter, copy them into the appropriate folders, generate an index, and create a static searchable retrieval system. Original files must not be altered.
Core functionality:
* Accept a master input folder(s) with many files/subfolders.
* Extract text and metadata from each document.
* OCR scanned PDFs and image-based files where possible.
* Classify each file into the correct data-room category using AI/NLP.
* Copy files into a clean output folder structure.
* Generate a CSV/Excel manifest showing original file name/path, new folder location, document type, assigned category, confidence score, short classification explanation, key terms/entities, duplicate status, and review-needed flag.
* Create a static searchable HTML index that can be opened locally without a server.
* Flag low-confidence or unreadable files for human review.
* Detect exact and near-duplicate files.
* Allow the folder taxonomy to be edited and rerun.
Required file types: PDF including scanned PDFs, Word .doc/.docx, Excel .xls/.xlsx, PowerPoint .ppt/.pptx, images .jpg/.png/.tif, text files, and email files .msg/.eml if feasible.
Initial taxonomy should be editable but include: Admin/Index, Project Overview, Land Control/PSA, Title/Survey/ALTA, Zoning/Land Use/Local Approvals, Environmental/RCRA/BRAC/FOSET, Wetlands/Streams/USACE, Floodplain/Drainage/Stormwater, Geotechnical/Soils, Civil Engineering/Site Planning, Power/Utility/AEP/SWEPCO, BTM Generation/BESS/Energy, Natural Gas, Water/Wastewater, Fiber/Telecom, Permitting/FAST-41/Federal/State, Vendors/Proposals/Budgets, Capital Markets/Investor Materials, Correspondence, and Unclassified Review Queue.
Preferred approach is local-first or hybrid: local text extraction/OCR where possible, local duplicate detection, AI/API classification only where needed, ability to run in a controlled local environment, and no permanent upload or storage of confidential documents by the contractor.
Possible technologies may include Python, Tesseract OCR, PyMuPDF/pdfplumber, python-docx, openpyxl, python-pptx, sentence-transformers, FAISS/Chroma, OpenAI API or another LLM classifier, and a lightweight local interface such as Streamlit, Flask, or a simple desktop GUI.
Minimum acceptable UI is a command-line tool with clear instructions and config file. Preferred UI is a simple local browser or desktop interface where the user can select input folder, select output folder, choose/edit taxonomy, run classification, view progress, open review queue, export manifest, generate static HTML index, and rerun after corrections.
Security requirements:
* Do not modify original files.
* Copy files into output folders.
* Do not upload documents to third-party cloud services unless explicitly enabled.
* If API use is required, clearly disclose what text/metadata is sent externally.
* Do not store API keys in plain text.
* Contractor must not retain client documents.
* Testing should use dummy/sample files unless otherwise approved.
Deliverables: working tool/app/script, source code, editable taxonomy file, data-room folder output generator, CSV/Excel manifest, static searchable HTML index, duplicate report, review queue report, error log, installation instructions, user guide, and demo using sample files.
Acceptance criteria: project succeeds if the tool can process a mixed master folder, classify documents into the data room taxonomy, copy files without altering originals, generate an audit-friendly manifest, produce a local searchable HTML index, identify duplicates, and flag uncertain files for review.
Contractor proposal should explain recommended architecture, whether solution runs locally/cloud/hybrid, confidentiality protections, OCR approach, classification confidence scoring, duplicate detection method, static HTML search index approach, taxonomy editing, timeline, milestones, and similar projects completed.
This is not intended to be a full enterprise data room SaaS platform. I need a practical, reliable static document organization and retrieval tool that can prepare a data-room-style folder system from a large batch of real estate development project documents. The tool must also create error reports of documents that cannot be synthisized/read for cataloging.
Apri su Upwork