Voice AI Engineer — LiveKit Agent Infrastructure

Rozpočet: $30.0 - $45.0 HOURLY / FULL_TIME ⭐ 0.00 (0) India

We are building a real-time Voice AI platform for delivering natural, high-quality conversational experiences over telephony. Our stack is built on LiveKit for voice orchestration, and we need an experienced engineer to own the end-to-end pipeline — from STT/TTS integration and LLM dialogue management to SIP/VoIP call quality and self-hosted agent deployment. Challenges You Will Solve 🎙️ Voice Quality TTS output sounds robotic or unnatural — wrong intonation, poor pacing, and low conversational fluency across different use cases and languages. 📞 SIP / Call Quality SIP integration introduces latency spikes, jitter, and inconsistent reliability across regions, degrading the real-time experience. ⚙️ Pipeline Efficiency The STT → LLM → TTS pipeline needs end-to-end latency optimization to feel truly real-time in production. Key Responsibilities Own the LiveKit voice pipeline — configure rooms, egress/ingress agents, and SIP trunk integration for production telephony Evaluate and optimize TTS voice quality across providers; recommend and implement the best fit for target languages and personas Tune STT accuracy — custom vocabulary, language models, punctuation, and real-time transcription improvements Audit and redesign the full pipeline: STT → LLM → TTS → SIP with a focus on end-to-end latency and audio fidelity Debug SIP call issues: latency, jitter, codec mismatches, DTMF handling, and regional reliability problems Deliver reusable architecture — configuration templates, agent scaffolding, and system prompt frameworks for scaling voice bots Test and validate voice output — judge pronunciation, tone, turn-taking, and naturalness across real call scenarios Document best practices and provide a production-ready handoff for the team to build additional agents Required Skills & Experience 3+ years in Voice AI, TTS/STT systems, or real-time conversational speech pipelines Hands-on experience with LiveKit — rooms, Agents SDK, SIP connector, egress/ingress, and self-hosting Deep understanding of SIP / VoIP: codecs (PCMU, PCMA, Opus), RTP, jitter buffers, and telephony debugging Experience integrating and benchmarking TTS providers (ElevenLabs, Cartesia, Minimax, Azure Neural, Google TTS, or similar) Experience with Deepgram or equivalent STT APIs for real-time, multilingual transcription Proficiency in Python and/or TypeScript/Node.js for pipeline development Proven ability to diagnose and fix real-time audio issues in production environments Nice to Have Experience migrating from Retell, Twilio, or other voice platforms to LiveKit Familiarity with prompt engineering for voice-specific LLM behavior — concise responses, turn-taking, filler words, interruption handling Experience with WebRTC internals for low-latency audio optimization Multilingual voice system experience Knowledge of self-hosting LiveKit server, media servers, or TURN/STUN infrastructure What to Include in Your Proposal Relevant past projects — especially voice AI, real-time pipelines, or LiveKit deployments Your approach to improving TTS naturalness and STT accuracy (be specific about tools, voices, or configurations) How you would structure a reusable, self-hosted LiveKit architecture for multi-persona or multilingual voice agents Tools, services, or open-source components you'd recommend Availability and preferred engagement type (contract / full-time)

Otvoriť na Upwork