Building an Enterprise-Scale RAG Stack: Navigating the Hidden Costs and Challenges

Join Ofer, Developer Relations at Victara, as he reveals the seven pitfalls to avoid when building your own RAG stack and highlights the importance of having a platform like Victara that simplifies the process and provides scalability, security, and accuracy.

  • 1. The speaker is Ofer, who works in developer relations at Victara and has experience as a machine learning engineer and software engineer.
  • 2. The topic is the hidden costs of building your own RAG (retrieval-augmented generation) stack.
  • 3. RAG is a way to use large language models (LLMs) by pointing them to your own data rather than calling them directly.
  • 4. A RAG platform that is enterprise scale is much harder to build than most people realize.
  • 5. The RAG stack has two main flows: ingest and query.
  • 6. Ingest flow: Data is extracted from various sources, chunked into smaller pieces, encoded using an embedding model, and stored in a vector database.
  • 7. Query flow: A query is embedded into a vector, run against the vector store, and the relevant facts are retrieved. The retrieved chunks are then sent to the LLM with a prompt to generate a response
  • 8. Victara provides a RAG-as-a-service, where all the components of the stack are pre-built and managed for you.
  • 9. There is a significant difference between building your own DIY RAG stack and using a platform like Victara.
  • 10. Seven pitfalls to consider when building your own RAG stack:
  • * Quality of responses/hallucinations
  • * Scaling without high costs
  • * Unsustainable maintenance
  • * Non-English language support
  • 11. Hallucinations are still a significant issue in RAG, requiring investment in parsing, chunking, hybrid search, and evaluating response quality.
  • 12. Maintaining low latency can be challenging due to multiple components that may not be well-orchestrated.
  • 13. Scaling a RAG stack requires managing costs related to GPUs, CPUs, storage, and external services for tasks like PDF parsing or table understanding.
  • 14. Security and compliance are crucial, as failures in these areas can have large costs for organizations.
  • 15. Vendor chaos can be a significant issue when using multiple components from different vendors, making it difficult to diagnose and resolve problems.
  • 16. Building and maintaining a RAG stack requires unique skills in LLMs, embedding models, hybrid search, data engineering, security, and devops.
  • 17. Non-English language support may not be available for all components, requiring changes that could affect quality.
  • 18. Victara's approach is to provide a plug-and-play RAG-as-a-service platform that includes all the necessary components, which they manage and maintain for you.
  • 19. Users of Victara can access APIs for uploading data, indexing files, running queries, or using chat or other agentic RAG applications.
  • 20. Victara focuses on accuracy, good retrieval, security mechanisms, and observability with citations throughout the flow.
  • 21. Hallucination Evaluation Model (HHM) detection is a crucial aspect of Victara's system, providing users with a hallucination score for every call.
  • 22. HHM is open-sourced on Hugging Face and has over 3 million downloads, making it the most popular evaluation model for detecting hallucinations.
  • 23. Victara built a leaderboard to help users understand how likely various LLMs are to hallucinate on specific datasets.
  • 24. Users can deploy Victara's RAG-as-a-service in their VPC or on-premises, as many customers require.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!