A case for evaluating MCP

AI agents in ops hold clear promise: more adaptive, context-aware and autonomous operations. Powered by the Model Context Protocol (MCP), these agents can retrieve real-time data — logs, metrics, traces and runbooks — and take action.

But as agent deployments scale, so do the risks: hallucinated outputs, compliance gaps, security vulnerabilities, performance issues and untraceable decision-making.

This blog post introduces a solution: embedding AI evaluations (evals) into MCP-driven AIOps workflows. We’ll explore the limitations of federated context retrieval, where AI agents access and synthesize information across distributed sources, share insights from real-world agent deployments and show how evals can bring trust, safety and accountability to AI-powered operations.

The rise of context-aware AI agents in operations

The use of AI agents in operations (ops) is evolving from automation scripts to autonomous decision engines. This shift is driven by access to live context, made possible through the Model Context Protocol (MCP).

Traditional ops automation relies on static rules, preset API calls and pre-indexed data. But real-world incidents require up-to-date information, like logs, metrics, traces and tickets, pulled as events unfold. MCP allows AI to query systems in real time, correlate signals and suggest or trigger further actions.

These context-aware agents can now investigate incidents, recommend resolutions and assist humans.

MCP: Powerful, but risky

MCP allows AI agents to access live operational data so they can make real-time, context-aware decisions.

However, this capability comes with trade-offs.

Unlike static or pre-indexed data, federated retrieval through MCP is unpredictable. The accuracy and usefulness of results depend on many factors: network conditions, system availability, permissions and how well the agent interprets the request. This often leads to inconsistent or irrelevant responses — especially when scaled across complex environments.

These inconsistencies increase the risk of hallucinations — when agents generate incorrect or misleading outputs. And because agentic workflows often chain decisions together, a single error early in the retrieval process can cascade into larger failures. This makes root causes harder to trace. In ops, this could mean false alerts, misdiagnosed incidents or unsafe automation.

Latency is another issue confounded by live context injection. Real-time queries take time, especially under system load, which can delay responses and reduce the agent’s effectiveness in time-sensitive situations.

Security is also a growing concern. A Backslash Security report found hundreds of exposed MCP servers at risk of abuse. Without proper controls, agents or attackers may be able to access sensitive data or trigger unintended actions.

In short, MCP is a powerful enabler. However, without real-time evaluation and strong governance, it introduces significant risks in accuracy, performance and security.

AI evals are a critical layer

As AI agents take on more responsibility in operations, continuous evaluation becomes essential.

Without it, agents operate as black boxes. Verifying if their outputs are grounded, if their actions are safe or if performance is degrading over time is inherently difficult. This lack of visibility creates a new layer of operational risk which can undermine trust, traceability and control.

Ensuring scalable and trustworthy AI agent performance, evaluation should occur on three levels:

Automated metrics for real-time performance, such as success rates, latency and hallucination frequency.

LLM-as-a-Judge to assess reasoning at scale.

Expert review to ensure domain relevancy.

In a recent banking AIOps project, we embedded an Eval component by working with AI eval platform Weights & Biases into our agent architecture to measure grounding quality, decision accuracy and response performance in real time. This allowed us to detect hallucinations early, track behavioral drift and improve agent reliability before incidents reached production.

A real-world case

At a leading LATAM-based bank, we’re building an AI-powered solution to transform traditional operations into a more proactive and autonomous model. The solution consists of two integrated components (Figure.1):