Stabilizing Agent Swarms: A Guide to Model Conduct Protocol (MCP) Evaluations

As CEO of Fruit Signals, Ari Helchuk explores the power of Model Conduct Protocol (MCP) in stabilizing AI-based agents, workflows, and swarms, enabling more robust evaluations and self-improvement.

1. Ari Helchuk is the CEO of Fruit Signals, a platform for measuring, controlling, and optimizing LLM (Language Learning Model) based agents, chatbots, and workflows.
2. The question at hand is whether agent evaluation by MCP (Model Conduct Protocol) could stabilize agents and agent networks/swarms.
3. Stable agent swarms can theoretically solve any knowledge work problems, but tend to be unstable when solving complex problems due to observation difficulties, inability to test comprehensively bef
4. The solution to this problem involves evaluation or 'evals', but not just by adding an eval stack; systematic usage and continuous development of evaluations are crucial for improvement.
5. Evaluating all aspects of agent behaviors and internal representations is complex and includes both reality representation (agent models, discussions, grounding) and behavioral aspects (goal infere
6. To get started, a clear framework for setting up evaluators is needed; an example was given using a reservation agent for a hotel with evaluators like policy adherence, accuracy of outputs, and app
7. Fruit Signals has implemented this evaluator setup, providing good visibility to create, maintain, and improve large stacks of evaluators systematically over time.
8. The goal is to create a stabilization loop where the agent's output gets evaluated by an evaluation engine, receiving feedback in the form of a numeric score and explanation for improvement or succ
9. Agents can use this information to improve their own performance; the Model Context Protocol (MCP) is the latest method for attaching these agents to the new greatest method.
10. MCP allows users to measure and improve specific text or output from an agent, using charges (collections of evaluators), and universal evaluators available on the server.
11. The MCP server provides access to evaluators through a UI like Cursor, enabling users to optimize marketing messages by specifying the charge for evaluation.
12. For the marketing message example, the system found issues with the text, tried to improve it, and displayed final scores for persuasiveness, quality, writing, and engagingness.
13. The agent can either explicitly choose evaluators and charges or let them be picked automatically; this was demonstrated using a simple hotel reservation agent implemented on Pyantic AI.
14. Without MCP, the reservation agent politely mentions another hotel (Akma), which is not desired; with MCP, the agent no longer mentions it due to policy adherence evaluation.
15. The MCP server evaluates agents through lists of evaluators and makes calls back and forth, enabling agents to improve their own behavior by invoking relevant evaluators from the list.
16. Users should ensure their evaluation platform is powerful enough to support diverse evaluators, their life cycle maintenance, optimization, and scaling for complex examples.
17. Running MCP manually offline before attaching it to agents helps users understand how it works and gain transparency.
18. Attaching the evaluation engine through MCP makes the process more controllable, transparent, and dynamically self-improving/self-correcting.
19. The root signals MCP server is available for free; other similar platforms will appear, enabling various evaluation frameworks to be implemented as part of an agent stack.
20. The presentation emphasizes the importance of agents, evaluations, and the symbiotic relationship between them.
21. The use of MCP can help in controlling, optimizing, and improving agent behavior while ensuring consistent performance.
22. A stable agent network or swarm can significantly contribute to solving complex knowledge work problems by enabling better decision-making and task completion.
23. Fruit Signals aims to simplify the process of evaluating agents using MCP and related tools, empowering developers to create more efficient and effective AI systems.
24. The future of agent evaluation lies in platforms that provide dynamic self-improvement, self-correction, and seamless integration with various agent stacks.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!