2025 was important for our team at Thoughtworks. Across more than 16 clients, we delivered 20 PoCs for real IT operations with eleven reaching production. These experiences have informed the multiple pieces we’ve published on AIOps. What we’ve seen overall is that while AIOps has moved beyond a concept and entered the engineering phase, it has not yet industrialized.
So, what have we learned over the past 12 months? In this blog post we’ll outline what we’ve discovered and what these lessons reveal about the future of operations when AI systems are evolving quickly.
Learning 1: The POC-to-production gap remains structural and technical
AIOps has crossed the credibility threshold. More than half of our PoCs reached production. The rest failed for structural and technical reasons. Specifically, failure happens when:
AI governance is missing. Enterprises lack operating models to govern AI systems in production.
Operational knowledge is not AI-ready. Critical context is unstructured, fragmented and costly to engineer.
AIOps requires continuous tuning. Operations teams lack capacity to run and improve intelligent systems.
Enterprises wait for vendor AI. Adoption is deferred to SaaS roadmaps rather than built in-house.
In practice, enterprise AI maturity is the dominant factor determining whether AIOps can scale. Enterprises need to recognize AIOps is not a one-off investment;ongoing operating costs force a serious buy-versus-build decision.
Learning 2: The highest-value AIOps use cases about knowledge, not autonomous actions
In 2025, our strongest use cases were:
Detecting duplicate incidents.
Retrieving operational knowledge.
Assisting with root-cause analysis.
These use cases delivered immediate operational impact. Across deployments, L1/L2 ticket volume was reduced by 35–40% and RCA cycles were shortened from hours to minutes.
By contrast, autonomous remediation and auto-healing remain constrained by risk, governance and accountability boundaries. They have not scaled beyond controlled environments.
Until enterprises establish a mature AI governance foundation, AIOps will not replace human decision-making. Its role remains cognitive augmentation, not autonomous agency.
Learning 3: MCP and other communication protocols are not yet industrialized
Using AI agents to connect upstream operational sensors and downstream action systems is the right architectural direction. However, current agent communication protocols such as MCP remain immature for production-grade operations.
In complex scenarios, we observed a clear entropy problem:
Context chains grow uncontrollably.
Orchestration becomes highly complex.
Execution paths are opaque.
Latency and cost are volatile.
Agent behavior lacks observability.
These characteristics are unacceptable for mission-critical systems. While new frameworks such as Claude Agent Skills are beginning to address these limitations, agent communication remains a key bottleneck on the path to enterprise-grade AIOps.
Learning 4: We are entering the era of hybrid deterministic–probabilistic systems
A new class of hybrid systems is emerging, combining deterministic and non-deterministic components. The industry lacks an operating model to run them.
A typical execution pattern now looks like:
LLMs generate structured decision inputs (e.g., JSON).
Deterministic engines execute policy and logic.
Generative systems reinterpret outputs.
In these systems, failures can originate in:
Reasoning
Translation
Execution
Semantic rendering
However, today’s SRE stacks only understand deterministic execution paths. They cannot trace intent, confidence, reasoning or semantic drift. Operating these systems requires a tool switch to LLM observability, which introduces additional cost and complexity.
SRE is entering a paradigm shift. Its tooling must evolve to operate deterministic and probabilistic systems together.
Learning 5: Context requires effective operation at scale
AIOps performance is bounded not by model intelligence, but by context availability.
Enterprise operational knowledge — system dependencies, team topology, change history, ontologies, incident history and runbooks — remains scattered across systems that are fundamentally not AI-ready.
Indexed search and federated retrieval are insufficient. Enterprises require a purpose-built operational context engineering layer: an AI-readable context for AI agents to understand operations, enabled managed short-and long-term memory.
This is where the current SRE tooling providers are missing, a winning tooling will be those who can also help enterprises manage their long-term operational context continuously.
Without context engineering, AIOps becomes little more than a conversational interface on top of fragmented data.
Learning 6: Agent-native AIOps tooling is emerging, but operability is still missing
2025 saw the rise of agent-native AIOps platforms built around generative AI and orchestration frameworks. While these tools demonstrate strong reasoning and automation capabilities, they remain incomplete beyond a tool. Three foundational components are consistently missing:
A reliable agent framework for long-running operational workflows.
A context engineering layer that provides enterprise-specific memory.
An open observability layer that allows operations teams to trace, audit and optimize agent behavior.
As the SaaS industry shifts from selling tools to delivering AI agent workforce, operability can no longer be optional. AI agents must be treated as production workloads - observable, governable and continuously improvable. Without this, agent-native AIOps will remain powerful as tools but fragile as a workforce.
Learning 7: AIOps is extending to manage other assets
AIOps is no longer confined to application operations.
We observed its expansion into:
Infrastructure operations (AI SRE).
Data operations (ETL reliability, performance optimization, data quality).
AI workload operations.
AI-assisted modernization of legacy systems.
Operations is a high-cognitive-load function. This creates a natural expansion path for AI-based cognitive augmentation across all operational domains. In 2026, we plan to extend AIOps capabilities across our full portfolio of managed services, from data platforms to AI workloads.
AIOps is evolving from a tooling category into a control-plane category — the layer that governs how enterprises observe, reason, decide and intervene across all operational domains.
From learnings to practices
Closing the PoC-to-Production gap through AI readiness
Improving AI readiness, we offer an enterprise Knowledge Retrieval Foundation aligned to the client’s AI stack and invest upfront in making operational knowledge AI-ready. For waiting enterprises, we support buy-versus-build decisions and define long-term AIOps roadmaps.
Building an AI Control Plane for proactive operations
AI controls shouldn’t be point-to-point solutions, we are designing an AI Control Plane to centralize risk control, auditability, AI evals and human-in-the-loop governance, enabling AI-driven proactive operations as production workloads.
Partnering with the tooling ecosystem on context engineering
We are strengthening collaboration with the AIOps and observability ecosystem, where tooling providers are increasingly recognizing that context engineering is foundational to intelligent operations. Our focus is on shaping platforms that make enterprise context AI-native by design.
Evolving SRE and Managed Services for AI-native systems
As AI assets and hybrid intelligent systems move into production, SRE is entering a paradigm shift. This transformation is redefining the expectations placed on Managed Services. We are upgrading our MSP capabilities to operate AI-native enterprises where intelligence itself becomes a production asset.
The road ahead
Looking back at 2025, we are confident that AIOps remains one of the strategic domains where AI can deliver tangible business outcomes. Today, the impact is still concentrated at the edge — improving productivity through cognitive augmentation and decision acceleration. But the direction is clear and the foundation is being laid.
As intelligent assets move deeper into production systems, enterprises will be forced to operate at a new level of complexity. Operating models built for deterministic software will no longer be sufficient. The industry must reinvent how systems are observed, governed and controlled. AIOps will be a central pillar of that transformation.
We also see a clear inflection point emerging in the SRE and incident management tooling market. Reliability must expand beyond traditional systems to include AI-driven and non-deterministic workloads. This shift will define the next growth wave for infrastructure and operations platforms.
2026 is here — we look forward to sharing what comes next.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.