AI Engineer / RAG Pipeline Developer for Compliance Law Management Information System

Budget: $10.0 - $40.0 HOURLY / PART_TIME ⭐ 5.00 (3) United States

database-architecture, python, artificial-intelligence, amazon-web-services

Key Responsibilities You will be responsible for building an end-to-end pipeline including: 1. Data Collection & Crawling - Design and implement web crawling pipelines for legal/compliance sources - Extract structured and unstructured legal content from websites and portals - Ensure compliance with robots.txt and legal scraping constraints 2. Document Processing (PDF + Text) - Build robust PDF parsing and extraction pipeline using tools like Docling - Handle complex legal documents (tables, footnotes, multi-column layouts) - Clean, normalize, and structure extracted content for downstream AI use 3. RAG Pipeline Development - Design and implement Retrieval-Augmented Generation architecture - Chunking strategies optimized for legal/compliance context - Embedding generation and metadata enrichment - Query understanding and response synthesis using LLMs 4. Vector Database (Pinecone) - Set up and optimize Pinecone vector database - Design indexing schema (metadata, filters, namespaces) - Optimize retrieval speed and accuracy - Implement hybrid search if needed (keyword + vector) 5. AI/LLM Integration - Integrate LLMs (OpenAI / open-source models) - Build prompt engineering for compliance/legal reasoning - Ensure traceability and citation-backed responses Required Skills - Strong experience building RAG systems in production - Hands-on experience with Pinecone or other vector databases - Experience with PDF parsing tools (Docling, PyMuPDF, Unstructured, etc.) - Strong Python backend development skills - Experience with web scraping/crawling frameworks (Scrapy, Playwright, etc.) - Familiarity with LLM APIs (OpenAI, Anthropic, or open-source models) - Understanding of embeddings, vector search, and semantic retrieval - Experience handling large-scale document pipelines Nice to Have - Experience with legal tech or compliance systems - Knowledge of information retrieval / NLP - Experience with LangChain, LlamaIndex, or similar frameworks - Cloud deployment (AWS/GCP/Azure) - Docker / Kubernetes experience Deliverables - Fully functional ingestion + crawling pipeline - PDF processing system using Docling or equivalent - Pinecone vector database setup with optimized schema - Working RAG system with API endpoints - Documentation of architecture and setup - Optional: simple UI for testing queries Project Type - Short-term MVP with potential for long-term extension - Possibility of ongoing development and scaling How to Apply Please include: - Relevant experience building RAG systems - Examples of similar AI or document intelligence projects - Your preferred stack for RAG pipelines - Any experience with legal/compliance data systems

Open job