TTS / Voice Cloning: the cheapest high quality setup for my SaaS (built to scale)

Budget: - HOURLY / NOT_SURE ⭐ 5.00 (11) USA

pytorch, machine-learning, python

I'm building a faceless video SaaS, similar to vidrush.ai. The whole video production side is finished and working. The only thing left is TTS. Right now I have two top notch open source models (Chatterbox: https://github.com/resemble-ai/chatterbox, and Higgs: https://github.com/boson-ai/higgs-audio), but my server is CPU only, so the generations run in the background through Modal. With Modal I pay per second of usage, which piles up fast once I have a lot of users. Here's the thing: surprisingly almost nobody cares about the video generation. They just want unlimited TTS, so I'll be pricing around that. It's the same way algrow.online grew (about 90% of their users are there only for TTS). So TTS is basically becoming the main product. What I need: a high quality TTS setup that's as cheap as possible per generation, with voice cloning, speed control, and expressiveness, so I can promise unlimited generations, keep users happy, and still stay profitable. Deliverable: recommend the approach and/or provider, with a rough cost at scale. Optionally, help me set it up. If you'd rather just point me to the right provider or approach, that works too. I can pay an agreed referral fee if it pans out. To apply, tell me the cheapest high quality TTS stack you'd use for voice cloning at scale, and roughly what it would cost.

Openen op Upwork