Optimizing Voice Models for Production: Focusing on Runtime and Infrastructure with Orpheus TTS

Optimizing Inference for Voice Models in Production: Strategies for Runtime, Infrastructure, and Client Code to Achieve Fast and Efficient Conversational AI Pipelines.

1. Philip from B10, a model inference platform, will discuss optimizing inference for voice models in production.
2. B10 runs production workloads for various AI-native startups and enterprises.
3. Philip is based in San Francisco, his favorite part being better sports teams than in Chicago.
4. Orpheus TTS is one of Philip's favorite voice models, developed by Canopy Labs.
5. The talk will cover TTS model architecture, performance metrics, optimization techniques, and infrastructure.
6. There are two types of models: autoregressive transformers (LLM or LLM-adjacent) and diffuser image type models.
7. TTS models are architecturally similar to LLMs, allowing for rich ecosystem tooling usage to improve TTS models.
8. Orpheus TTS is used as an example; it has a Llama 3.2 3B backbone and open-source code from Canopy Labs.
9. Performance metrics for voice models differ from LLMs, focusing on time to first bite/sentence and throughput.
10. Voice model optimization aims to minimize latency and GPU costs while maintaining real-time streaming capabilities.
11. Utilizing TensorRT and FP8 quantization can improve performance for LLMs and smaller models like Orpheus TTS.
12. Snack, an audio decoder, and torch compile are used to optimize the audio decoding process in the pipeline.
13. Token level continuous batching is not yet available for voice models, but dynamic batching helps achieve similar results.
14. The TTS implementation with Orpheus has shown improved performance, now being CPU-bound in many cases.
15. B10's base implementation supports 16 simultaneous streams with variable traffic and 24 with constant traffic on an H100 MIG.
16. Properly configured implementations can reduce time to first bite to as low as 150 milliseconds in real-world testing.
17. Non-runtime factors, such as infrastructure and client code, can significantly impact overall performance.
18. Sending requests sequentially and establishing a new session for each request increase latency in production.
19. When implementing voice pipelines, consider using protocols like LiveKit, Pipecat, websockets, or gRPC.
20. A well-designed infrastructure is crucial for minimizing network overhead and meeting SLA requirements.
21. Voice agent pipelines consist of listening, thinking, and talking components, with the infrastructure connecting them being vital to latency.
22. B10 will host an event at Fogo de Chão next week, focusing on building systems with open-source models.
23. Philip can be reached on Twitter or LinkedIn for questions about model performance.
24. The talk concluded 8 seconds early, allowing for additional Q&A or closing remarks.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!