Brief summary
When working with modern distributed systems, complexity is a given. But how can you make observability a characteristic of your systems, such that your operators get feedback in the event of an outage? In this podcast our co-hosts Rebecca Parsons and Neal Ford talk to Bharani Subramaniam and Prasanna Pendse about the monitoring and observability in cloud-based systems.
Podcast transcript
Rebecca Parsons:
Hello everybody and welcome to the Thoughtworks Technology podcast. My name is Rebecca Parsons, the chief technology officer for Thoughtworks. I'm one of your hosts and I'd like to introduce the rest of the panel for today. Then we'll talk a little bit about observability and monitoring. So Neal?
Neal Ford:
Hi, everybody. I'm Neil Ford director and meme wrangler at Thoughtworks and another of your regular recurring hosts. And today we have a couple of our colleagues who are quite interested in a subject we're going to talk about today. So I'll let them introduce themselves. Bharani.
Bharani Subramaniam:
Hello, this is Bharani Subramaniam. I'm head of Technology Thoughtworks, India.
Prasanna Pendse:
Hello. This is a Prasanna. I'm also the head of Technology for Thoughtworks, India. We can fight that out later. We have three actually. With two of us here.
Rebecca Parsons:
Okay. So we'd like to talk to you today about observability and monitoring. And there wasn't necessarily a whole lot to talk about when everything was running in a JVM on a single process, you know, life is good. But we have encountered multiple situations where we've had to start to tease apart the distinction between these two things when we're in a distributed architecture world. So let's start with the basics. Bharani and Prasanna Pendse, how do you define the difference between the activities of monitoring and observability and what the outcomes are that we're trying to achieve?
Bharani Subramaniam:
Right. I would define monitoring as a continuous process of checking the output of the system, right? So it can be anything like is the process alive? Are we getting the heartbeat or the list latencies that's satisfying the SLA. So those are the things that we usually call as monitoring. Whereas observability is more of how well can you measure the internal state of the system based on its external output? Right. So when we talk about observability, we usually think about what are the service calls that are getting call when this user performs this action in the system. As opposed to just checking the status of the system.
Neal Ford:
So there's an implication in observability of being able to peer inside. Whereas monitoring is more black box. Is that the distinction?
Bharani Subramaniam:
Yeah, that is the distinction. So...
Prasanna Pendse:
And monitoring is an active action that somebody is doing. Whereas observability is characteristic of a system in the sense that observability is and monitoring is continuous action that is performed. Observations of a system that is observable, but usually monitoring does not try to change the system that is being monitored, right. So you just kind of observe the things that are happening without anything special being done on that system. Whereas for a system to be truly observable, it needs to do something extra to expose this internal state.
Bharani Subramaniam:
Yeah, exactly. So like what Rebecca was mentioning before, if you just had a single process running on a single machine, then monitoring is enough. But if you have multiple processes running in a bunch of machines, even if you have the instrumentation coming out of the system, you need something to stitch them together to give a coherent feel and hence we need them observability.
Rebecca Parsons:
So it is both given the white box, black box distinction that Neil was making at a high level, the objective is the same. You want to understand what's going on in the system. But at the lower level actually with monitoring or I guess with a single system or a single process on a single machine, there is less you have to worry about because you just look at the outputs. And where we have distributed systems, just looking at the output from the different systems in isolation doesn't give you an end to end view of what's going on.
Prasanna Pendse:
Right. When you're in the world of a single system, monolithic code base and all of that, tools that do instrumentation can actually instrument pretty well, especially in the Java world. And can inspect what is inside the code base in terms of function calls and all that stuff without having any real change being made in the system. But they would struggle to tie together why one system behaves in a certain way and another system behaves another way. What is the linkage between these two, especially when you multiply that by 10,000 servers doing lots of different things. It's very difficult to chase down a single event, single malfunction and be able to root cause it without having observable people into the systems.
Rebecca Parsons:
Okay, so then how do I go about achieving observability? What kinds of things do I do? What kinds of changes do I make to the system to end up with something that has this characteristic of observability?
Bharani Subramaniam:
We've had this observability for a while now. I think the problem has always been what are the tools that we choose that is always a lock in and what happens to the court that the frameworks and libraries are using? So until this time, that has been a challenge of how would you place what's going on with the system. But with standards like OpenTracing this is now more approachable than it was before. So if I have to observe my distributed system in production, if the team implements OpenTracing in their code business with the heterogeneous nature, you can have agents and clients running for your own framework and languages. And you can instrument them with open standards. And you can observe the system in production is because it's talking the standard API.
Neal Ford:
But if observability requires action on developers, how can you ensure that developing teams are going to put the observability code in their code base? This seems like a major problem because if you're relying on this for insight and then somebody forgets to put it in there, then you lose the insight. Right?
Bharani Subramaniam:
That is true. So we are talking about a maturity level here. So if you have a fairly complex system, but you are okay to live that observing the boundaries of the system. Then there are a lot of turnkey solutions out there, lots of frameworks. There are middleware for OpenTracing so you don't have to write code, you just have to turn them on. And you can to a fair degree observe what's going on. That will take you to a level. But if you have to really understand what parameters when passed to this API is slowing down your system, you really have to depend on the developers putting that code in. Because there is no way to instrument that part from the outside. But if you just want to raise the boundaries of the system, this can be done in a turnkey fashion.
Prasanna Pendse:
So taking on a bit of a skeptical view to this, one of my questions or one of the questions that a client of ours been ask us, is either way what you're saying is, I need to buy AppDynamics or DynaTrace or New Relic or one of those things and turn them loose on my ecosystem as the first step.
Prasanna Pendse:
Why does this distinction particularly matter to me? Whether these things are running on a single machine or they're running against a cluster and with this tool, the Open Tracing turn on?
Bharani Subramaniam:
Yeah. And that's a really good question. So, I would answer that by asking the question to the team back. I mean people often say they embrace microservices architecture and they are following DevOps practices. And I usually ask this probing question of then, "What about OpsDev practices?" Right. So we have everything in our system to commit code faster to have continuous integration going on. You automated your infrastructure, you do all these things to make your operation life easier. What is it that you're put in the system for the operators to give feedback back to the development teams from production in case of an outage? So if you are running a distributed system, you need to invest on putting this information in your code. So when you have a date, well the problem of something going down in production, people should be able to support it.
Rebecca Parsons:
But isn't that what we used to use logging for?
Bharani Subramaniam:
Good question. But again, back to the same problem of if I just had one APA server running and I had this client, I think just tailing the log would be enough. I don't think that's the state anymore, Rebecca. So you have the N number of API and with Kubernetes people just changed replica set. And you now have 10 magic instances of the same API. Different hosts could have different time and actually stitching all this together and then having a system to actually view this, you would in the building what is already out there. So you don't have to reinvent this whole wheel of how to observe a system. You could just start off OpenTracing for that use. Most of the tools that is out there now adopt OpenTracing so you may not even know that your current monitoring tool is actually having absorbed the place right now.
Neal Ford:
Yep. That's companies like Netflix does a lot of innovation in this space. As they were innovating in microservices. And I saw a presentation from one of their DevOps folks. And they said that one of the ways they handle the problem of making sure that everything has observability code in it, is they use aspects to automatically wrapper every method. So that they know there's a consistent view from an observability standpoint. And it also strikes me since Rebecca and I are on the call, I can't resist mentioning that you could also use architectural fitness functions as a way to verify that every one of your functions touches observability and broadcast something as a way to automate that.
Bharani Subramaniam:
Yes, absolutely. I think the same initial question around logging, that part also applies. So even when we were in a single system, you had to log things in a particular way with a particular format in order for automated tools to be able to understand what was happening. And as you get into aspects of fitness functions is also a way of ensuring that you are doing that. Simple example that has existed for a long time was that if you don't log a credit card number by mistake. So having some fitness functions, that test started the way of doing that. Now in this system, one of the first steps that people may move towards is adding some kind of a correlation ID in your logging system so that you can trace, what happened in one thing and then have that thing go in throughout the stack for a particular call.
Bharani Subramaniam:
Right. But one of the challenges with that approach is it is not something that is portable from one type of a system to another. Meaning your database system may not actually know what your correlation ID is that an application server created. And before the request hits the application server, the web layer has done a few things that we have no way of tracing as a part of the same transaction because it could have made 30 other calls. And those get potentially tagged as different correlation IDs. And so tying all of this thing back becomes again, very specific to the particular platform that you're using. And then you add messages being tossed under Kafka and a hundred consumers doing something else with it. Then it goes to some other parallel universe. And tracing all of that stuff becomes a nightmare. Without having something that there's a standard way in which all of these different systems can specify, what transaction are we actually talking about?
Neal Ford:
Well, it strikes me that the skeptic persona just got answered by the pragmatist persona because that's a great answer to the question you posed earlier. Because the universe is different now. 10 years ago and we could rely on internal monitoring and logging. We didn't have Kafka and message queues and microservices and distributed architectures and all this crazy moving parts. So you can't assume that one part of the engineering ecosystem is going to stay static while all the rest of it changes and grows and into something completely different. So that's yet another rationale for, "Yes, we need more sophisticated tools to watch things because our architecture is way more sophisticated now."
Rebecca Parsons:
So I'll continue to play skeptic here. Why wouldn't I just use domain events? So if I've got an event driven architecture, I've got all of these domain events running around. Why is that not sufficient?
Bharani Subramaniam:
Yeah. For the same reason the correlation IDs are not yet sufficient. You still need a system to know which event occurred before which event. And yes, tying this together in a distributed fashion, it's not easy. Because you can't just rely on the timestamp from the systems because you may have systems that does not synchronized. So if you end up solving all of that, you are basically solving open tracing. So it's better off just to embrace the framework. Than reinventing the distributor trace. And the domain events are valuable. I don't think what we're saying is they're valuable, they're valuable to track what's happening in your business. Like is your revenue going up, number of orders going up, what does your conversion rate look like? All of those business metrics still need to be monitored. Somebody needs to be paying attention to that.
Bharani Subramaniam:
And collecting that, charting it, showing it, all of that is still valuable. But that's not the same thing as kind of the problems that OpenTracing is there to solve. Which is trying to debug something where something has gone wrong and you're trying to figure out exactly why did this order not go through. And or why do these class of orders not go through? And so they're having an ability to trace through is very important. But having a heterogeneous state where you have different types of technologies that were written in different eras, you will have a very difficult time kind of manually trying to trudge through that. And using some of these OpenTracing adapters you can get a deeper insight into even the older tech stacks as well as the newer ones.
Rebecca Parsons:
So Neil and I often get asked, "Okay, you talk about a fitness function, but what do these things look like?" So if I was going to add one of these architectural fitness functions to ensure that I had the right level of observability, how would I go about doing that?
Neal Ford:
I can start answering that question. There are a lot of tools like ArchUnit that will allow you to look at... So for example, let's say you decided to use aspects as a way to enforce observability. You could write ArchUnit to say, "Make sure that every a method that I make a change to in this code base is decorated with an aspect that does observability." So you can write those kinds of verifications with ArchUnit and the JavaWorld, a net arc test and the .Net World. So there are definitely some structural checks like that you could write as fitness functions. Even in a language like JavaScript, most of the linter tools or static analysis tools like PMD and Java, ESLint in the JavaScript World will let you look at the structure of your code and make some decisions about it. It's not quite as clean as something like ArchUnit, but it's definitely possible.
Prasanna Pendse:
And you'll need that at the time where you'll need that to test whether a given code base actually is emitting these events in a particular way or not, especially for the ones that are automatic. But I think there are other aspects also that you need to pay attention to. One is that the tech estate is generally quite diverse and not all of them will have something like aspects. And so there you will need to do something maybe a little bit more static. But there are also ways of keeping track of this in a dynamic way. So actually the OpenTracing tool set itself will give you that visibility and you can test against the output of that. When you run tests on a distributed system for a given transaction where there's a no op transaction, or something of that sort where you have the thing go through the system and you can test that that transaction goes through the system in a way that you expected to go through. So that's more of a runtime validation, a runtime fitness function.
Bharani Subramaniam:
Yeah. Or I would say if you can afford a service match in your infrastructure, you can also do this at the boundary of the network layer so you don't have to implement anything in your code, but you can trace the calls that happens to an API at the network boundary. Until this matches is pretty good for that.
Neal Ford:
Another good example of how the engineering practices have grown up to meet the demands of the more complex architectures we're building. So speaking of a service mesh, one of the things that we talk about in the evolution architecture book is this idea of what we call a Goldilocks Governance. One of the problems you have in microservices is the let a thousand flowers bloom problem of every artisanal development stack in the world. The problem you run into then is how do you monitor and create consistent observability across that entire stack. And so the idea of Goldilocks Governance is maybe that becomes the constraint on how many platforms you want to support in your microservice architectures.
Neal Ford:
How well supported are these platforms by the monitoring and observability tools that we've decided to standardize on as an organization. For example, monitoring tools like Nagios will run on a variety of different platforms. The common pattern in microservices is to create a sidecar component per platform that can plug into the monitoring infrastructure. And that way operationally you have a consistent infrastructure. Even if you have different implementation platforms for a different services.
Rebecca Parsons:
So how broadly available are tools respecting the open tracing standard? Is this available on most of the platforms that we see or is it still relatively limited in its uptake?
Prasanna Pendse:
Well, I think one thing is that open tracing is not officially a standard yet. At least according to their website it looks and behaves like an official standard, but it's not quite a official standards body that is authorizing that. However, a lot of tools have adopted it as if it were a standard because it provides a common playground for people to inter operate.
Bharani Subramaniam:
Hmm. Interesting. I think 2017 open tracing, was standardized is under CNCF. I don't know if that qualifies it for being an open center.
Prasanna Pendse:
Sorry, according to their website, so the first CNCF is not an official standards body.
Bharani Subramaniam:
Okay.
Prasanna Pendse:
The OpenTracing API project is working towards creating a more standardized API and instrumentation for sort tracing. So it's not like an IEEE standard or something of that sort. It is a new body that isn't officially one. But to answer Rebecca's question around tooling... Bharani you seem like you were saying something and I interrupted earlier.
Bharani Subramaniam:
Yeah, yeah. So I was going to say you answer to Rebecca in terms of tooling, we have Zeplin, we have Agar. There are a bunch of tools out there that supports most of the popular languages and frameworks out there. Be it Golang, Java, JavaScript, Python. There are adopters for all of these languages and most of the frameworks in these languages.
Neal Ford:
So it sounds like you would say that it would be considered a good piece of advice to use something that supports that standard, right? That would be a best practice of using air quotes here. But that would be considered a good idea in this space. So what are some other good ideas in this space? We touched on correlation IDs earlier, right?
Bharani Subramaniam:
Yeah. One other question that we often run into when we adopt open tracing is... Look, I have this logging and I need to log and I have open tracing. You can also log as a part of a trace. So which one should I use? Right? And this is one of the early questions that people seem to ask when they adopt OpenTracing. And the advise will always been, if you want to lock something that is attached to a user journey, you're much better off adopting open tracing for it than using your normal logger. But if you want to log something for the system, something is not working or something is down, you are logging for the operator. So you paired off adopting the existing logging framework because the thing with OpenTracing is that it's a choice given to the operators to turn on and off.
Bharani Subramaniam:
So if the production is observing a lot of transaction, you can go from constant sampling to let's say 20% of the traffic. If you are using OpenTracing to log a system events that are chances that it'll get dropped. So another quote unquote good practice is that stick with normal logging for system and application events and stick with open tracing for user journey deleted domains.
Neal Ford:
And it sounds like another best practice was don't use domain events. That feels to be like the same kind of eakiness of using domain stuff for keys and relational databases rather than generating keys. So that would be another considered good practice in this space is don't use domain events.
Prasanna Pendse:
I think the way, actually Bharani and I were talking about it earlier. The way it looks like is that OpenTracing itself is its own domain. That this is a specific concern from a operation standpoint and it has its own language and it has its own concerns. That is orthogonal to whatever business you're in. And so standardizing on that language allows you to get better at managing these distributed systems. So it has words like spans and baggage and things of that nature, which may not be in your business domain. But they help in standardizing a conversation from one operations team to another.
Rebecca Parsons:
Well, and also based on what Bharani was just saying about when to log versus when to use the open tracing system. It sounds like now we actually have three domains. We have the peer system. Gee, I've got a processor that's not answering me or whatever we might log at the system level. And then we have the business domain where we're talking about orders and sales. And then we have this open tracing domain. Which is helping us understand how a given business event has been realized through a series of activities that spread across our distributed architecture.
Prasanna Pendse:
Right, right. And I think across all of those domains as well as kind of the earlier technical advice that we have been giving to people, one of the dangers in the way a lot of tools are approaching this, is that they're trying to incentivize people to get locked into their stack. And this tracing is one way in which some of the disorder tooling provide their own ability to do monitoring. And what happens is that your entire operations team becomes kind of geared, your metrics collection, your dash boarding, all of that thing gets locked into a given tool. And sometimes for technical reasons, you may choose to move away from that tool to something else. And then all of that becomes a bottleneck. All of that prevents you from moving on. So OpenTracing kind of separates the concerns between the tool that actually does the work and the tool that does the monitoring. And observability is kind of the layer that enables you to separate those two concerns.
Prasanna Pendse:
So yeah, so kind of the best practices would be to avoid getting logged into a tool specific monitoring solution.
Bharani Subramaniam:
Yeah.
Prasanna Pendse:
And kind of try to separate us.
Bharani Subramaniam:
I was going to say, I would say plus one, oftentimes we see beams where the choices are constrained by what kind of monitoring tools are out there. I think that is sort of a 90 part that we have seen. So OpenTracing now you've seen the freedom where you can choose whatever pool to collect. But you can stick with the standards so you commit evidence that are for open tracing and the standards actually make the whole thing work.
Rebecca Parsons:
So are there other implications within my infrastructure, within my technology estate that I have to take into account, to be able to support observability?
Prasanna Pendse:
So one of the things that happens in this kind of a world where you have logging. You have the actual business events that are flowing through, you have these open tracing type events that are happening and probably other types of events that are flowing into the network. There's a lot of data that you're now moving back and forth in your data center. And so one of the challenges I think people have is the network infrastructure wasn't actually created for that much data to be moving through. And as we go through more modern systems, especially for example, you have Telemetry that is tracking user behavior. And you're not outsourcing that to an analytics company, but you're actually getting those events back into your system. And doing your own analytics and having real time changes in what you present to the user based on that. Those kinds of systems again the network traffic goes up and so there will be an impact to the kind of switches that you have and all of that.
Prasanna Pendse:
Martin talked about you must be this high to do microservices. I think that rule applies here as well is that yes, things are going to get more complicated. But you should venture into this world of distributed systems and microservices only if your need justifies all of the complexity that it brings with it. And this is one other aspect of that complexity is how do you actually trace things across lots of different systems in a way that is efficient on your network. Or how do you actually then change your network to be able to handle this.
Prasanna Pendse:
And that applies largely to people who they have their own data center. Obviously in the world of cloud providers, this becomes a little bit easier. Although if you end up needing to change that, like some of the cloud providers provide InfiniBand band connections. I'm not suggesting that you need Infiniband to do observability, but as your network needs grow cloud providers are probably going to provide better, faster connections than your data center can in the timeframe that you have to meet the business needs.
Rebecca Parsons:
Well, hopefully now we all understand the difference between observability and monitoring. And why we need to think about observability really as its own domain when we are architecting distributed systems, microservices based store or otherwise. So thank you Bharani, thank you Prasanna for joining us and we hope to see you all next time on the Thoughtworks Technology podcast.