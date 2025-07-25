Modern distributed systems demand proactive and precise incident investigation to minimize downtime and reduce operational burden. At Thoughtworks Managed Services, troubleshooting isn’t just a technical necessity — it’s a business-critical function. Effective root cause analysis (RCA) translates directly into cost savings and operational efficiency across downstream support activities.

To deepen our investigative capabilities, we’ve embraced technologies like eBPF, which offer visibility into kernel-level behavior. Groundcover, a leader in this space, makes these ground-level insights accessible in real time without requiring deep kernel expertise from every engineer.

At the same time, we’ve been embedding AI agents into our operational toolchain to augment and accelerate the work of our human engineers. With the rise of the Model Context Protocol (MCP), a new standard is emerging that allows AI agents to go beyond static Q&A and interact directly with live systems and data sources.

Accessing kernel-level data through Groundcover has now become a cornerstone in upgrading our AI agents for more intelligent and context-aware incident investigation. In this blog post, we share the background, our approach, the integrated solution we built with Groundcover and the results we’ve seen so far.

Agentic AI: Tackling high-cognitive-load tasks

In our daily support operations, the incident investigation process often begins with correlating logs to understand the underlying issue. Once an SRE engineer receives an incident from the incident management platform, they typically conduct a multi-dimensional investigation (Diagram 2) that includes:

Metrics — Analyzing time-series data across dashboards to identify anomalies in performance or resource usage.

Logs — Searching distributed log sources for error patterns, exceptions or key terms related to the affected component.

Traces — Following request traces to pinpoint slow or failed spans and locate the origin of degraded user experience.

Events — Reviewing system and infrastructure events (e.g., restarts and deployments) that may correlate with the incident timeline.

Topology — Mapping dependencies between services to understand the potential blast radius and upstream/downstream impact.

Changes — Investigating recent configuration, code or infrastructure changes that could explain behavioral shifts.

The outcome of these investigative activities typically leads to one of three types of solutions: