Title: "Breaking Down Broken RAG Evaluation: Fixing the Corpus-Query Disconnect

Join Ival and Niv from A21 Labs as they challenge the status quo in Rag (Retrieve, Answer, Generate) evaluation, revealing the limitations of existing benchmarks and proposing innovative solutions to overcome these challenges.

  • 1. RAG (Retriever-Reader) models are commonly used, but current evaluation methods have limitations.
  • 2. Many benchmarks consist of local questions with local answers, assuming the answer can be found in a specific chunk of data.
  • 3. These benchmarks often feature manufactured and unrealistic questions that don't reflect real-world data.
  • 4. Benchmarks usually focus on either retrieval or generation, neglecting critical aspects like chunking and parsing.
  • 5. Real-world data is messy, varying between sources, making it difficult to generalize RAG system performance.
  • 6. A vicious cycle occurs when developers build RAG systems for flawed benchmarks, leading to high scores but poor user experience.
  • 7. The primary issue lies in the disconnect between benchmark questions and real-world data.
  • 8. To address this, it's essential to create more holistic and representative benchmarks for RAG evaluation.
  • 9. Financial or regulatory documents like SEC filings often require aggregative questions that standard RAG systems struggle with.
  • 10. A corpus of 22 historical FIFA World Cup pages from Wikipedia was created to study the performance of RAG systems on complex, non-local questions.
  • 11. Common RAG pipelines perform poorly, as they only answer 5% to 11% of the questions correctly.
  • 12. The solution involves converting unstructured data into a structured form and then querying it directly for specific question types (e.g., counting or calculation-based).
  • 13. The ingestor phase clusters documents, identifies schemas, populates them according to each document, and uploads the results into an SQL database.
  • 14. At inference time, the system identifies the schema relevant to the query and executes a standard text-to-SQL query over the data.
  • 15. This approach is not a one-size-fits-all solution, as not every corpus or query can be transformed into a relational database.
  • 16. Many corpora aren't homogeneous in terms of attributes, making schema creation challenging.
  • 17. Normalization issues arise during ingestion and inference, and handling ambiguity is another challenge.
  • 18. The system needs to balance complexity, granularity, and compute investment during ingestion.
  • 19. Text-to-SQL conversion becomes complex when schemas are intricate.
  • 20. RAG systems aren't one-size-fits-all solutions; they must be tailored for specific clients.
  • 21. Existing benchmarks don't capture many real-world use cases.
  • 22. To tackle the problems, it might be necessary to go beyond standard RAG approaches in certain settings.
  • 23. A21 Labs has a YAP podcast episode discussing structured RAG and the challenges of RAG evaluation.
  • 24. Realistic benchmarks and understanding specific use cases are crucial for improving RAG system evaluations.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!