Building AI Agents: Collecting Actionable Feedback & Understanding User Interaction Patterns

Unlocking the power of AI agents: How to instrument your code, collect actionable feedback, and build evals that drive user satisfaction - lessons from Zapier Agents' journey.

1. Zapier is a software for automating business processes, with "agents" as an alternative to Zaps.
2. Building good AI agents is challenging, especially when considering the unpredictability of users.
3. A key lesson learned is that building probabilistic software requires more effort than traditional software.
4. Initial prototypes are only the beginning; after deployment, understanding usage patterns and failures is crucial for improvement.
5. To collect actionable feedback:
a. Instrument your code to record tool calls, errors, pre-/post-processing steps, etc.
b. Make runs repeatable for evaluation purposes.
c. Use logs to convert data into evaluable runs.
6. Explicit user feedback is high signal but rare; ask for it in context (e.g., after an agent finishes running).
7. Implicit feedback can be mined from user interactions, such as turning on an agent or copying model responses.
8. Look for implicit signals in conversations, like users asking the agent to stop or rephrasing their questions.
9. Consider using a language model to detect and group frustrations for further analysis.
10. Analyze traditional user metrics for more implicit signal.
11. To effectively analyze data:
a. Use LM ops software to understand agent runs.
b. Turn every interaction case or failure into an evaluable run.
c. Perform feedback aggregation, clustering, and bucket interactions.
d. Identify problematic tools and interactions for improvement.
12. Reasoning models can help explain failures by finding root causes from trace output, inputs, etc.
13. Create a hierarchy of evaluations (unit tests, end-to-end, AB testing) based on failure modes.
14. Unit test evals predict the n+1 state from the current state and are best for specific failure modes.
15. Overindexing on unit test evals can make it difficult to see the forest for the trees when benchmarking new models.
16. Trajectory evals grade an agent's run till the end state, considering all tool calls and artifacts generated.
17. LM as a judge can compare results from evaluations, but ensure it judges correctly and avoids introducing subtle biases.
18. Rubrics-based scoring with LLMs can help create high-level overviews of system capabilities for benchmarking new models.
19. Don't obsess over metrics; remember that achieving a high score doesn't necessarily mean good performance.
20. Divide the data set into regressions and aspirational pools to ensure no breakage of existing use cases and challenge the model.
21. The ultimate goal is user satisfaction, making AB testing the ultimate verification method.
22. Monitor feedback, activation, user retention, and other metrics when conducting an AB test.
23. Focus on optimizing for user satisfaction rather than achieving high evaluation scores.

Source: AI Engineer via YouTube

❓ What do you think? What is one thing you would change about your current approach to building AI agents, given the insights and challenges discussed in this video? Feel free to share your thoughts in the comments!