Building an AI Agent for Real Estate: A Systematic Approach to Evaluation Framework Development
Join Emil and Haml as they share their journey of building an AI-powered real estate agent and reveal the secrets behind their evaluation framework, which enabled them to rapidly improve their language model's performance.
- 1. Emil is the CTO at Reat, and his partner Haml will also be discussing the product they built.
- 2. The product is designed for real estate agents and brokers, with features like contact management, email marketing, and social marketing.
- 3. They realized they had many internal APIs and data, leading them to build an AI agent for real estate agents.
- 4. They started by creating a prototype using GPT 3.5 and the React framework, which was slow and made mistakes but provided a majestic experience when it worked.
- 5. Transitioning to production, they faced challenges in measuring improvements and avoiding breaking use cases when making changes.
- 6. Haml discusses a systematic approach to improving AI consistently, including avoiding common traps and resources for learning more.
- 7. A key part of the approach is creating an evaluation framework, which includes unit tests and assertions based on observed failure modes in data.
- 8. Reat wrote simple unit tests and assertions, such as testing if agents are working properly, emails are being sent, and invalid placeholders are not repeated.
- 9. Logging results to a database and using existing tools like Metabase for visualization and tracking progress is essential.
- 10. Human review of logged traces is also important, and building custom data viewing and annotation tools can help remove friction in looking at data.
- 11. When starting out, synthetic data generation using language models can help create test cases when real users are not available.
- 12. Prompt engineering is a good way to make progress on AI, helping debug issues and providing satisfaction in making progress.
- 13. An evaluation framework allows for filtering good cases for human review, curating data for fine-tuning, and continuously updating fine-tuning data.
- 14. Using LM as a judge can be beneficial when not everything can be expressed as an assertion or unit test.
- 15. Aligning the LM judge with a human is crucial to ensure reliability and trust in the LM judge.
- 16. Common mistakes include not looking at data, focusing on tools instead of processes, using generic evals off the shelf, and using LM as a judge too early.
- 17. Reat managed to rapidly increase the success rate of their AI application using an evaluation framework.
- 18. Emil emphasizes that few-shot prompting will not replace fine-tuning in all cases, especially when complex commands or mixing structured and unstructured output is required.
- 19. Incorporating user feedback and executing complex commands were challenges Reat faced, which led them to use fine-tuning.
- 20. Creating a comprehensive eval framework was essential for building an application that could handle complex tasks in minutes instead of hours.
Source: AI Engineer via YouTube
❓ What do you think? What are some common pitfalls or oversights that organizations making significant investments in AI-powered tools may overlook, potentially leading to limited returns on their investment? Feel free to share your thoughts in the comments!