Exploring Real-World Implementation of LLM Evaluations in Complex Applications

Join Aperta, co-founder of Arise, as he demystifies the real-world applications of Large Language Models (LLMs) and explores best practices for evaluating LLMs in complex workflows.

1. The speaker, Aparta, is a founder of Arise, which focuses on LLM (Language Learning Models) evaluations and observability.
2. There are different types of EVals (evaluations) in the context of LLMs: model EVals and task EVals.
3. Model EVals compare and rank models based on MMLU (Machine-learned Multi-task Language Understanding) metrics, which can be helpful in determining the best model for specific tasks like the needle-
4. Task EVals assess whether an LLM application is working effectively by defining relevant evaluations to determine its success.
5. The talk will focus on task EVals and their role in real-world applications, which are typically more complex than a simple API call and response.
6. In the real world, many applications involve AI judging AI, where an LLM is used to evaluate another LLM's output based on input and context.
7. An example of such an application is a chat-to-purchase system often seen in eCommerce, which determines user intent using LLMs.
8. The key to these applications is ensuring that the LLM correctly identifies the user intent, as incorrect identification can lead to undesired outcomes.
9. When evaluating complex AI-based applications, it's important to have EVals at various levels of the application to ensure proper functionality and identify issues.
10. For a router-based application using function calling, there should be an evaluation at the router level to ensure that the correct function is called based on user input and context.
11. Developers can build EVals iteratively alongside their applications, starting with benchmarking, developing eval templates during the application-building phase, and monitoring in production.
12. Using EVals with explanations has proven to be more helpful for real users deploying applications in production environments.
13. Single incorrect evaluations can be challenging to address without additional context; explanations help teams pinpoint specific issues and determine how to fix them.
14. While numerical output evaluations may seem like a good idea, they can often be too binary or imprecise for certain use cases.
15. For example, a model might return a score of 10 for both an 80% corrupted document and an 11% corrupted document, without providing any granularity to differentiate between the two cases.
16. The needle-in-a-haystack test is a popular evaluation on Twitter that explores how context window placement impacts LLM performance.
17. Research shows that if you put the fact at the beginning of the context window, it can be challenging for an LLM to retrieve or remember it, especially as the context window size increases.
18. In addition to retrieval tasks, it's important to evaluate LLMs based on their generation capabilities. For example, models like Anthropic 2.1 may perform better than GBD4 in common generation tas
19. The speaker is hosting an event called Rise Observe on July 11th, where researchers and builders will share their insights on task and model EVals.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!