Building an AI Agent for Real Estate: A Systematic Approach to Evaluation Framework Development

Join Emil and Haml as they share their journey of building an AI-powered real estate agent and reveal the secrets behind their evaluation framework, which enabled them to rapidly improve their language model's performance.

1. Emil is the CTO at Reat, and his partner Haml will also be discussing the product they built.
2. The product is designed for real estate agents and brokers, with features like contact management, email marketing, and social marketing.
3. They realized they had many internal APIs and data, leading them to build an AI agent for real estate agents.
4. They started by creating a prototype using GPT 3.5 and the React framework, which was slow and made mistakes but provided a majestic experience when it worked.
5. Transitioning to production, they faced challenges in measuring improvements and avoiding breaking use cases when making changes.
6. Haml discusses a systematic approach to improving AI consistently, including avoiding common traps and resources for learning more.
7. A key part of the approach is creating an evaluation framework, which includes unit tests and assertions based on observed failure modes in data.
8. Reat wrote simple unit tests and assertions, such as testing if agents are working properly, emails are being sent, and invalid placeholders are not repeated.
9. Logging results to a database and using existing tools like Metabase for visualization and tracking progress is essential.
10. Human review of logged traces is also important, and building custom data viewing and annotation tools can help remove friction in looking at data.
11. When starting out, synthetic data generation using language models can help create test cases when real users are not available.
12. Prompt engineering is a good way to make progress on AI, helping debug issues and providing satisfaction in making progress.
13. An evaluation framework allows for filtering good cases for human review, curating data for fine-tuning, and continuously updating fine-tuning data.
14. Using LM as a judge can be beneficial when not everything can be expressed as an assertion or unit test.
15. Aligning the LM judge with a human is crucial to ensure reliability and trust in the LM judge.
16. Common mistakes include not looking at data, focusing on tools instead of processes, using generic evals off the shelf, and using LM as a judge too early.
17. Reat managed to rapidly increase the success rate of their AI application using an evaluation framework.
18. Emil emphasizes that few-shot prompting will not replace fine-tuning in all cases, especially when complex commands or mixing structured and unstructured output is required.
19. Incorporating user feedback and executing complex commands were challenges Reat faced, which led them to use fine-tuning.
20. Creating a comprehensive eval framework was essential for building an application that could handle complex tasks in minutes instead of hours.

Source: AI Engineer via YouTube

❓ What do you think? What are some common pitfalls or oversights that organizations making significant investments in AI-powered tools may overlook, potentially leading to limited returns on their investment? Feel free to share your thoughts in the comments!