Improving Chatbot Performance Without Traditional Prompt Engineering: An Auto-Improving Agent Approach

Join me as I challenge the notion that prompt engineering is dead, and share my journey of improving a simple chatbot without doing any prompt engineering - instead, leveraging evaluators and agents to optimize prompts for better results.

1. Prompt engineering as a concept is questioned, as it is seen as trying to make language models behave nicely rather than true engineering.
2. Story of improving a simple chatbot for a company website without prompt engineering.
3. The chatbot's task was to answer questions related to the company's documentation.
4. Desired improvements included relevance, usefulness, and reduced mistakes in the answers given.
5. The idea of creating an auto-improving agent to research and apply latest prompt engineering techniques was considered.
6. To build this agent, an evaluator was needed to assess the chatbot's performance.
7. A dataset of questions related to the documentation was created for evaluation purposes.
8. The evaluator would compare the chatbot's responses to expected answers or use learned models to judge answers.
9. There are different types of evaluators, such as LM-based judges and classic NLP metrics.
10. Evaluators can assess individual components (e.g., vector database) or the complete execution of the RAG pipeline.
11. An example is given of an LM-based judge that checks generated answers against provided facts, returning a boolean response and a reason for failures.
12. A score is calculated based on the number of correct facts across all examples evaluated.
13. The agent's role is to iteratively improve the prompt used in the RAG pipeline by combining reasons for failure and prompting guides.
14. This process resembles classic machine learning training with some modifications.
15. An example of using an AI researcher agent to optimize prompts is provided, demonstrating significant improvement in the chatbot's performance.
16. The importance of avoiding overfitting when using a limited number of examples is highlighted.
17. Ideally, more examples should be used and split into training and testing sets for robust evaluation.
18. The presenter acknowledges that they still engaged in prompt engineering to create the agent for optimizing prompts.
19. The project is available as a demo on the company's GitHub repository (trace loop/autoprompting demo).
20. Attendees are encouraged to try out the demo, ask questions, and book time with the presenter if needed.

Source: AI Engineer via YouTube

❓ What do you think? What are the unintended consequences of relying on AI-generated prompts, and how can we ensure that they align with our original goals and values? Feel free to share your thoughts in the comments!