Brief summary
What if your AI agents could think more like IT operations staff — and less like tools? In this episode, we catch up with Zichuan Xiong, to explore the Model Context Protocol (MCP) — a powerful new way to give AI agents deeper awareness of the tools, information and history they need to work effectively in the operations space.
Unlike traditional APIs that just trigger functions, MCP adds a semantic layer of context that helps AI understand what to do, why it matters and how to do it better.
Whether you’re deep in site reliability engineering (SRE) or just curious about the next leap in AIOps, this episode unpacks how MCP could be the missing layer between today’s tools and tomorrow’s autonomous systems.
If you want to find out more, check out this piece co-written by Zichuan.
Ken Mugrage: Hello, everybody. Welcome to another edition of the Thoughtworks Technology Podcast. My name is Ken Mugrage. I am one of your regular hosts. With me today is a longtime Thoughtworker, works a lot on our AI and so forth. I'll let him introduce himself. Zichuan, you want to introduce yourself?
Zichuan Xiong: Hi, everyone. My name is Zichuan. I joined Thoughtworks about 18 years back. In the past 18 years, I was switching a lot of different roles. I was into AI about three, four years back, leading model AI solutioning at Thoughtworks. I focused on industry solutions. Then I joined a managed service business at Thoughtworks called DAMO, which is our managed service business. Right now, I'm handling the AI ops for our managed service business. Usually, we assign three-year deal with the customers. My job is to bring the AI solutions into our IT operations, focusing on infrastructure-managed services, application-managed services, and also data managed services. That's me, based in San Francisco, and MCP is a huge deal for me. Recently, I wrote articles about MCP. I'm excited about the breakthrough of AI technologies in the past three years.
Ken: I really appreciate you joining, and as you hinted at there, I have a background in DevOps and SRE and that whole movement, and I know that we've interacted that way in the past. You recently did an article with a couple of our partners talking specifically about MCPs and APIs in the SRE-type context. What I wanted to do today is dive into that a little bit for our listeners. First off, just from a context perspective, an SRE, why do you need more than just APIs? Why MCP? What's it bring in? That sort of thing.
Zichuan: When I think about this, SRE means Site Reliability Engineering. In the past 15 years, 20 years, the SRE as a term expanded a lot. If you think about it, it's not only about a site, it can be any type of workloads. It can be AI application, it can be a data product, it can be a pipeline, multiple different things. Reliability change a lot. When people think about reliability, it's not a non-functional requirement, it's one of the ilities. Reliability can mean trust, it can mean continuity, it can be customer experience, it can be performance, multiple different things.
Also, engineering change a lot. How we do things, it's so different compared to 15 years back. To me, SRE is a huge deal into the context because you always need to consider the context, either investigating some of the incidents or try to figure out how to resolve the things, the incidents. That's to me, the context we're seeing here is we have the shifted context about SRE, and also a lot of the technology leaders are trying to build out AI into their existing SRE practice. All the way like two years back, when people start talking about the agentic AI flows, people are talking about building those AI agents into a linear focus or a value chain into different workflows.
That's when the technology leaders were thinking about bringing the AI agents into the SRE practices. About a year ago, Anthropic brought the MCP context into the game and talking about bringing the context to the AI agent so the AI agent can perform better. That's really whole story about how things are going on. From changing landscape of SRE to the AI agent usage to the MCP as a context provider to enable the AI agents to play better, to make better decisions. Those are really context we're seeing. That's why I personally believe MCPs should deal for SRE all the way because SRE is changing, AI agent is changing the game, and also, AI agent requires more context in the SRE space.
Ken: Just to be clear, doing the system reliability engineering on any application, even if the application isn't AI, this is using AI and MCP and all these things to manage and run and secure and et cetera, but it's not only for AI apps, is that correct?
Zichuan: Yes, that's correct. We keep talking about AI for operation, also operation for AI. It's two different things. In this context, I'll majorly talk about the AI for operations. Operation for any type of workloads. It can be applications, it can be a workload, a data pipeline, anything beyond AI. AI application is just one type of the workload. That's the context we're bringing here.
Ken: Then just for the benefit of our listeners that may not be aware, what's your short definition of MCP or Model Context Protocol? What do you mean when you say it? Just so they know what we're talking about the rest of the podcast.
Zichuan: MCP means Model Context Protocol. To me, it's a semantic, dynamic context layer to enable AI agents. It is beneficial for AI agent to understand the context better, then they can perform better to generate more specific output in the very responsible way, which is really required in SRE practice.
Ken: That's great. In the past, we've talked about the APIs in an SRE context. You mentioned earlier the illities, observability and scalability, and all those. We would have an API that we would use to get information. Often, that information is information overload. What's the paradigm shift here? Why MCP and not just APIs that have worked?
Zichuan: To me, I use this analogy. My wife always give me a Post-it, like, "Go to a grocery store to buy stuff." The Post-it writing, it's just a list of things. I need to buy a milk or chicken wings or something like that. It's just a list of things. I don't need to ask a question. I just go there and make sure the label are matching. I just buy it. I don't need to know what she's preparing, what type of dinner for this thing she's buying. I just pick the order, deliver it. To me, that's API. It's function, it's a script, it's structured. 99% is not wrong if you follow the instruction, you follow the schema, versus MCP is something like, she will say, "Sweetheart, we're going to have a anniversary, a 10th anniversary next month." What do you do?
That, to me, that's the MCP. It's, I need to figure out the context. I need to figure out, oh, we're having this anniversary. I need to find out what she likes. Try to plan something. To me, that's a real dynamic, that's contextual, that's semantic. You have to bring something together with my wife, to bring something together. That's to me the difference between the MCP and also API. API is really good at those strict function-based or schema-based structured input and output dealing with different systems. MCP is for more dynamic, open-ended, the request task, generative tasks. Also, you have to consider the context. That's why the difference between two things, I would say that it's not replacing, MCP is not replacing API. It just for different purposes, different scenario. You need a different type of job to be done and you need to apply different technology into this.
Ken: Would this be something like, for incidence response, for example?
Zichuan: Yes. If you think about the incident response, you have an incident, the things as an engineer, you will deal with, you will connect the multiple different contexts to that incident. You have to look at the block to understand what's going on. From the block, you have to go to the different impacted system, if you understand the impact to those system. Also, you need to connect the impact to the business metrics to understand the priority, whether this is hurting business continuity, this is raising a security issue. You always need to bring those context to the business.
Then you have to connect X with T to those different contexts. If this is about security, you have to connect to a security expert, or this is about compliance, you have to connect to compliance expert. In this case, investigation on incident, try to find a resolution, is context shifting gig. It's a very dynamic way of doing things, the shifting context. That's why I personally believe MCP plays a great role here because incident management resolution is a very dynamic, open-ended, decision-making task to do. Even though API is providing the support, I focus on the rule-based support, but that dynamic context-shifting exploration is also important in incident management. That's why MCP plays a critical role in supporting this.
Ken: What would be the scope here? For example, I want to do an incident response, and let's say I'm a retailer, and nobody can place an order on my website. That could be database, it could be throughput, it could be disc space, it could be a million things. Forgive me if I say something silly from an agentic architecture perspective, but do you have different agents for each of those things that are all sharing a context through the MCP? Do you have multiple of these? What's this look like on the ground from an implementation perspective?
Zichuan: From implementation perspective, there are just two, they're very simple. You have MCP server, you have MCP agents. The client, or MCP client. MCP servers are those like the knowledge system who provide the context. Let me give you some example. Right now we're building multiple different MCP clients for our client's adoption. We are building post-mortem analysis report every time we have an incident. In that case, we are trying to create the agent to generate that report. If you think about that report generation, it will consume the context from the past incidents and try to capture the learning and turn that into a report.
In this case, we're communicating with two MCP servers. One is really from the Jira. Jira MCP server will provide all the incident details. That's one. Then you have the Atlassian, more like a Confluence MCP server, to try to create the report. In this case, our agent is connecting those two servers and try to generate the two contexts together. One is the incident context, another one is the report format context and put them together to generate the post-mortem analysis. Right now, it's still case by case. Every time we have a solution, we think about, rather than building a customized base AI agent to do that, writing those instruction into the prompt building this special knowledge base. Or we can reach out the help from the two providers or MCP server provider to provide the context to bring contact to the AI agents. Right now, we're still at the stage of building those MCP client case by case. That's really the progress we're making today, building connection with multiple different MCP servers.
Ken: For the whole buy versus build thing, does that imply that at some point, if people have a particular type of infrastructure with a particular type of application, they'll be able to purchase the pieces to do this, or is this always going to be bake-at-home? What's that look like?
Zichuan: My point of view is always buy first. If necessary, build. If there's something super important to you, you think this is really your demonstrator, then you build. Always try to take a buy-as-accelerating approach. The good news in this space is most of the main strain, the two chain provider in SRE space, are releasing some sort of MCP servers so that every single company, I talk to at least 10 partners every month, everyone right now is having a roadmap being some sort of MCP servers. What you need to do is just try to build up some MCP client, and that's it.
Also, building MCP client, you can also use AWS Bedrock, you can use Vertex AI. They also support a lot of features so you can easily build out those MCP clients. To me, building those things is not that hard. Maybe how to adopt this, the eval is going to be problem, the security check is going to be a issue. That industry is not really fully solved yet. It's not solved yet, but I personally think it's like the whole industry will catch up. I try to build up more solution focusing on a scaled approach, a scaled implementation for those MCP workloads in the future.
Ken: If we say that APIs are one side of an equation and full MCP servers with multiple clients and handing off context and then there's the other side, what's in the middle? Is there anything, or is it a jump?
Zichuan: To me, of course, orchestration is super important. This is really, again, we're experimenting. I don't think in the market has a good definition, a framework to think about that orchestration layer in between whether this is API or this is MCP. That gateway is something I see some multiple technology providers being in some gateway type of thinking. Right now, we actually created something like a prompt-based routing system to route whether this is you should reach out to MCP to get a context. If there is a natural language about the user query and we think it is super dynamic, it's not structured enough, then you should reach out to MCP to get some help.
Right now, we're also waiting for the industry to have some really good gateway thinking. We also see some players focus on MCP marketplace generation, try to wrap all the MCP server access, all the security setup, and even the MCP client building, and you wrap them into services. We also see players that focus on MCP like marketplace as well. Those are really new things. I said this to my clients. I said, there are certain stage, maybe you don't go too fast because a lot of company are not even implementing agents into their SRE value chain. I would say, try something, less, build out 15 AI agents into SRE.
Now think about enjoying the problem of those AI agents cannot share a good context and knowledge, then bring some MCP. Then next step is try to, when your MCP ecosystem is complex, bring some thinking around the MCP gateway or MCP orchestration to really consider API, API ecosystem, and also the MCP ecosystem together. To me, that the future step. I see really a only small portion of the clients have tried to go that far right now.
Ken: Then, switching gears just a little. There's been technologies that have come out. If we only look at the last 10 or so years, you got Docker and Kubernetes and all that stuff, that were open source and had standards and fairly good on their own momentum with different people contributing and so forth. Then it's different cloud providers and would pick them up, and the implementation started to drift. It was easy to say that, "Oh, I can do my Docker container, I can run it anywhere," or, "I can have my Helm chart for Kubernetes, and I can run it anywhere."
In practice, it actually ran a little different on cloud provider A versus cloud provider B or what have you. It wasn't just so easy that you could pick up your Kubernetes-based application and shift it like the dream was. How are they behaving here? Here, Anthropic came out with an open-source thing less than a year ago. My understanding is OpenAI and Google DeepMind and Microsoft, and all these others are saying, "Yes, this is a good idea." How are they behaving, though? Are we diverging? Are we actually having a standard? What's your impression?
Zichuan: I don't think there is a discussion around a standard yet. We see our clients implementing, as I said, MCP is still at the early stage, to be honest with you, especially in the client, the enterprise adoption. If you look, let's say 60, our clients' SRE-managed services, I would say 30% of them are implementing AI into their SRE. I, myself, my job is to push AI adoption. Maybe only another 10% considering the MCP because they have been implemented, multiple different agent. Then the context sharing becoming a problem for MCP to solve that problem.
I still think the adoption is not there yet, but I personally believe that's a good paradigm to think about. What you just mentioned, maybe that's a happy problem you solve in the next six months. I don't think there is a huge deal right now. Even people are talking about standard, talking about security of the MCP, but we're not there yet. That's just my observation in the industry. Maybe there are some certain areas, especially I'm focusing on the SRE space. Their MCP is applying for multiple different other spaces.
Ken: If I'm on an SRE team or that type of team, I'm really trying really hard not to say DevOps team because I'm a believer there shouldn't be such a thing, but if my job is running and managing these things and understanding what to have for incident response and so forth, what should I be learning? What should I be watching? What should I be reading? How do I make sure that I'm not falling behind on this stuff? Because, like you said, it's all being divined.
Zichuan: My tangible suggestion to all the engineer leaders, the operation leaders in companies in enterprise is, just create a list of your mainstream, your existing tool chain. Either it's Dynatrace, it's DataDog, New Relic, whatever. You must have 15 of them, then try to reach out to them, asking direct question, what's your roadmap for MCP? Someone say, "Oh, we already have the MCP server there." Then you ask the question, "Let's do a demo." Ask them to provide you a demo to connect some of the AI agents to their MCP server and see the difference because MCP right now is like you going to rely on those two provider to help you accelerate the thing.
That's always my genuine suggestion to the IT leader there. Like, "Use your partner to do that." They have a plan to deploy MCP and ask them the question. That's my suggestion. Learning is one thing, but implementing is another thing. It's tangible use cases, just use it. We use [unintelligible 00:20:58] we think is promising. It's getting the tangible business barrier for us. It's just, start implementing and experimentation.
Ken: Admittedly, some of this is hype marketing, but there's a lot of AI-enabled tools and agents and agentic architectures and that kind of thing that have been out for a couple of years now, much longer if we don't consider LLM. AI was not actually invented two years ago. There's lots of stuff out there. If someone is already using something that's doing that, that's using AI and writing the context in another way, I don't know if it's RAG or whatever it is, are there compelling reasons for them to be looking at changing, or should they wait for this to mature a little better, or what do you think about that?
Zichuan: My suggestion, the reason why you use MCP is try to augment your existing RAG approach or to improve the efficiency and the performance of your AI agents. My opinion is really to just try it because I think this is a tangible, at least from the practitioner point of view, we're implementing this. We're working with the accelerators and with the different partners. We believe this isn't working because this is our daily job. We're managing our customers' infrastructure and applications. We try to do things faster. There is a business case for us.
Do something in the context of driving a business case. This area rarely has some business case. If you consider other AI focus areas, I think SRE is definitely the big area, which has a great use case, great business case to deliver. My suggestion is, don't hesitate or work with the community. Of course, there is a hype in the market for sure, but calling out something hype is easy to be honest with you. Trying something, try, go to the hype, try something out is hard. That's really my suggestion.
Ken: I love that. People are like, "Oh, it's all hype." No, it's not all hype. Hype comes from somewhere.
Zichuan: That's also my job to differentiate what's real, what's not real, but always have your point of view. The only way to having your point of view is you try something out.
Ken: That's a really good point. If I'm an engineering leader and I'm developing new applications, and I'm going to be creating them, let's assume Greenfield. Let's not assume Legacy or what have you. Are there things that I should be thinking about from a architecture perspective to make this sort of thing easier later? Like, "I want to make my SRE more effective. I want to be able to solve incidents quicker. I want faster mean time to recovery, all the stuff from Dora, et cetera. Are there things that these folks should be thinking about from an architecture perspective or even team structure? What should people be doing to make sure that they make you more effective?
Zichuan: Because everything we're talking about make it easy to operate, that follows the same principle when we're thinking about the cognitive applications. Those best practice, like building the cognitive application will apply here. I don't have actual advice to the technology leader to say, "Make the architecture better, too, so that SRE can be better." To me, it's all about observatory first, you always need to make sure your systems are at what level to be more observable by the different other tools. Always look at your vendor list. Look at your vendors.
Some two providers are cloud-native or AI-native or AI-first. Some providers are established. Their tools, their capabilities, established for a long time. Maybe not that cloud-native. It's just two selections. Selects. Try to combine your strategy together. Try to look at those big, established, two-provider, large-scale, but always bring some of the AI-native. Those new companies are built out only two years, three years old, they bring new practices, folks in SRE. I always want to take a partnership or ecosystem play. Use your ecosystem to evolve your tech stack. It's not only about technical or architecture decision you're making. You also use your ecosystem drive you to improve your tech deck, to improve your tech practice.
That's really the reason why you work with Thoughtworks is, always, Thoughtworks brings some new ideas. Of course, you can work with a large-scale established company, provide services and tool. It's stable, it's predictable, but to drive innovation, sometime you have to bring those ecosystem player who are AI-native or AI-first into your toolchain. That's really my suggestion.
Ken: All right. I really appreciate that. For the listeners, we'll also put a link to the article that Zichuan wrote with JJ Tang and Rob Skillington in the show notes so you can check it there. Just, thank you for your time, but is there anything in closing that you'd like to add, something I didn't ask you about that you think people should hear?
Zichuan: We're doing multiple things for MCP. I would say the key focus area, I think if you're implementing the MCP, I would suggest you consider the following areas. It's just a matter of context, what context you'll bring to your incident manager. We are trying the following things. I think it's worth recommending and sharing. One is bringing the business context into the incident, problem-solving, incident resolution and navigation, investigation. For example, GCP, we're trying the Cloud Run as the MCP technology to bring the business context into the incident triaging resolution. That's one.
The second context you can bring is security context. That's something we're working with Panther, which is a security company, bringing their MCP server into the security detection, incident detection, logic creation as a use case. That's another thing, bringing the security context. The third interesting context is really about the observability context. Think about Chronosphere, think about New Relic. They hold all the lock information. We're using those MCP servers, their MCP servers to connect the SRE engineering together with observability context. Those are three area. Business context, security context, observability context. Those three areas, I think if you were experimenting this, those are tangible use cases you can work on.
Ken: Great. Again, thank you very much for your time. I appreciate it, and we'll speak to you later.
Zichuan: Yes. No problem. Thank you, Ken. I'll talk to you later.