Optimizing Voice Models for Production: Focusing on Runtime and Infrastructure with Orpheus TTS

Optimizing Inference for Voice Models in Production: Strategies for Runtime, Infrastructure, and Client Code to Achieve Fast and Efficient Conversational AI Pipelines.

  • 1. Philip from B10, a model inference platform, will discuss optimizing inference for voice models in production.
  • 2. B10 runs production workloads for various AI-native startups and enterprises.
  • 3. Philip is based in San Francisco, his favorite part being better sports teams than in Chicago.
  • 4. Orpheus TTS is one of Philip's favorite voice models, developed by Canopy Labs.
  • 5. The talk will cover TTS model architecture, performance metrics, optimization techniques, and infrastructure.
  • 6. There are two types of models: autoregressive transformers (LLM or LLM-adjacent) and diffuser image type models.
  • 7. TTS models are architecturally similar to LLMs, allowing for rich ecosystem tooling usage to improve TTS models.
  • 8. Orpheus TTS is used as an example; it has a Llama 3.2 3B backbone and open-source code from Canopy Labs.
  • 9. Performance metrics for voice models differ from LLMs, focusing on time to first bite/sentence and throughput.
  • 10. Voice model optimization aims to minimize latency and GPU costs while maintaining real-time streaming capabilities.
  • 11. Utilizing TensorRT and FP8 quantization can improve performance for LLMs and smaller models like Orpheus TTS.
  • 12. Snack, an audio decoder, and torch compile are used to optimize the audio decoding process in the pipeline.
  • 13. Token level continuous batching is not yet available for voice models, but dynamic batching helps achieve similar results.
  • 14. The TTS implementation with Orpheus has shown improved performance, now being CPU-bound in many cases.
  • 15. B10's base implementation supports 16 simultaneous streams with variable traffic and 24 with constant traffic on an H100 MIG.
  • 16. Properly configured implementations can reduce time to first bite to as low as 150 milliseconds in real-world testing.
  • 17. Non-runtime factors, such as infrastructure and client code, can significantly impact overall performance.
  • 18. Sending requests sequentially and establishing a new session for each request increase latency in production.
  • 19. When implementing voice pipelines, consider using protocols like LiveKit, Pipecat, websockets, or gRPC.
  • 20. A well-designed infrastructure is crucial for minimizing network overhead and meeting SLA requirements.
  • 21. Voice agent pipelines consist of listening, thinking, and talking components, with the infrastructure connecting them being vital to latency.
  • 22. B10 will host an event at Fogo de ChĂŁo next week, focusing on building systems with open-source models.
  • 23. Philip can be reached on Twitter or LinkedIn for questions about model performance.
  • 24. The talk concluded 8 seconds early, allowing for additional Q&A or closing remarks.

Source: AI Engineer via YouTube

âť“ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!