Brief summary
Many organizations regard code freezes as a way of reducing the risk of downtime during periods of peak demands. But associating outages with changes often masks a wider lack of faith in the deployment process — which is potentially where your focus should be. Here, our podcasters explore the negative impacts of code freezes and also the instances where code freezes can be beneficial.
Full transcript
Neal Ford: Welcome, everyone to The Thoughtworks Technology Podcast. I'm one of your regular hosts, Neal Ford. I'm joined today by another of our regular hosts.
Mike Mason: Yes, this is Mike Mason. This is very exciting because we're actually in person in New York office, one of the first in-person meetings that we're doing this year, where we put together the Radar. As part of the Thoughtworks Technology Radar discussions, we often have some pretty interesting things to talk about that sometimes make it into podcasts.
That's what we're going to be doing today. Neal and I are joined by Cassie Shum. Hi, Cassie.
Cassie Shum: Hello.
Mike: We're also joined by Ian Cartwright, all the way from Hamburg. Hi, Ian. How are you doing?
Ian Cartwright: Hi, there.
Neal: One of the things you may have heard us talk about before in this podcast is the category of things that were too complex to blip on our Radar. We're here putting together our radar, and a topic that arose a lot of passion that ultimately alas, fell to the too complex to blip category, is the topic of code freezes considered harmful. That's what Ian and Cassie are here to talk about. Let's start by describing what the bad thing is. What is a code freeze?
Cassie: A code freeze is essentially either a team or organization mandated freeze on all commits from developers and engineers to the codebase. This usually, as I've seen it lasts for potentially many weeks up to even up to a month or two. This is usually done because there is fear that any code that goes into the codebase, at this time would actually break something major in production. This is normally done before a big event like a Black Friday for e-commerce or a big banking transaction day where it would be very detrimental to the organization if anything bad happened in production on those certain days.
Neal: Everyone knows that when you have a whole bunch of bits, and you get them all arranged that you should let them settle for a few days to make sure they all align correctly. Is that where the reason for this comes from?
Cassie: Yes, I think a big reasoning, as well as the QA cycle as well, is usually quite long, and that is the time where a lot of both automated and manual QA happens to ensure that when we hit those important days that there's a steady situation on that day.
Neal: Ian, why is this a bad idea?
Ian: I think, I guess we can come on to it. I think there are legitimate reasons why you would want to do a code freeze, but in general, is a bad idea, because what happens is two things. One is, everybody knows that data is coming. The business stakeholders know, the developers know, the QAs know. You get this mad rush, where business people say, "I really need that feature in because if I don't get it in now, it's going to be January." Everybody's rushing and trying to get that last-minute piece of work in. They make compromises and they rush to do it.
Ironically, through trying to de-risk the changes, we end up throwing more changes at it and doing everything in a rush, which in itself ends up creating problems. That's one reason it's a bad idea. I think the other is that it betrays a lack of confidence in the deployment processes. Instead of fixing that, we just say, let's just try not to do deployments when we think it will be a risky period for our business. The root cause is we don't trust our own deployment processes.
Mike: I think a lot of organizations have organizational PTSD about some previous event where something went spectacularly wrong, and somebody, usually high up says, "What the heck happened? How do we make sure that never happens again?" Somehow code freezes before there's the solution?
Ian: Yes, absolutely. I can think of lots of clients where I've worked where they have code freezes where there's some traumatic event in the past, where the worst day of the worst month, they've had an outage and the business has lost a significant amount of revenue, and so the reaction if you like, is to fix the symptom. The symptom was we had an outage on that day, due to a deployment so therefore, don't do deployments.
Neal: Many years ago, I actually named that antipattern as the frozen cave person antipattern. It's almost like this horrible thing happened to somebody and they were flash-frozen and then awakened last week and that's the thing that they always bring up in every meeting but what happens if this happens again? It seems like a common antipattern in a lot of organizations.
Cassie: One thing to add as well is I think it also creates a very blame-centered culture. Because of some of these traumatic events, I also remember in some of my clients that it was the QA department that messed up or missed that actual bug that went in and ruined everything or it is this team or this particular feature team that payments is a good example where a lot of really important business logic goes into. This blame culture also propagates as well alongside the trauma and so with that fear just escalates into don't do anything and then that'll be the safest way to go about it.
Neal: I think that also helps prevent pushback because if you end up being the unlucky department that gets an ad for something like that, then you're very gun-shy in the future because we don't want it to happen to us again, because then they're going to take us out and lynch us or something. That creates an even more fear of responsibility, which we know is the opposite of effectiveness you create that culture.
Mike: I think some of it's not exactly justified, but you can understand where people get to. I feel like a lot of software systems have something I'd call implicit testing in them. If something's been sitting in production and running for a while and nobody's complaining about it, then even if you don't actually have automated testing or a formal QA process for that thing. The fact that it's been sitting there mostly okay, must mean it's mostly okay. If you don't have a good testing process, as soon as you change a single line of code in that thing that was previously okay, you lose all that implicit testing. Not that's a good strategy, but you can see why people fall into that.
Ian: I think related to that and I think many people have made this observation, so there's nothing new but what often happens is where you have a separate deployment team which is often the case in the organizations where we tend to see code freezes. They come to associate outages with changes and therefore, what can we do to reduce the number of outages? We'll reduce the number of changes. That becomes the tactic. The outage must have been caused by a change we made, therefore let's not make any changes. I think the other thing I've mentioned as well is the performance testing element of all of this.
Especially a lot of code freezes are introduced because you are going into a period of peak trading. You're expecting to see a lot of heavy demand on the system and quite often there's no real way to replicate that in any other time or any other place other than the actual event itself. There's a huge amount of nervousness and what tends to happen is, Mike, just as you were describing this, it's the system's been stable for four weeks and so, therefore, that means we're in a better place to cope with the peak in demand then if we just changed a bunch of stuff. Unfortunately, that rarely turns out to be the case.
Mike: Again, it is genuinely difficult to do this kind of testing. I do have a certain amount of sympathy for organizations. I remember working for a large lifestyle electronics retailer who at the time were doing everything with traditional fixed infrastructure and they had two data centers. In order to do a load test, they would have to take one of the entire data center out of the production load balancer and then just use that one half of their infrastructure in order to do their performance testing, which they still did do.
They could afford to because all of this fixed infrastructure is sized for the absolute peak transaction volume, which they knew they were not anywhere near because their annual sales cycle was not about to hit what they were doing performance testing. It is genuinely difficult especially when you have goodness and all kinds of downstream integrations as well. If you're talking retail, you've got payment gateways and banks and things that you're talking to, and that all need to be part of the test cycle. What's the alternative to code freezes then what should we be doing instead?
Ian: I think there's a couple of things, but one is when you do have outages firstly, you need to go and find the real root causes. Yes, you made a change but it's the way in which you made that change, maybe the lack of testing, or there's something in how you made the change. What you need to do is go all the way back and fix it at the point where you introduced it, not just put more and more gatekeeping in. That's one thing you can do to dry and address change freeze. The other is, and again, not a new observation, lots of people have said this, the more often you do something, the better you can become at it, and the lower risk it becomes.
Change freezes are often associated, they're not organizations who are doing a release every day, and then decide for three months, they're not going to do a release. They're often organizations who are doing a release once or twice a month, or maybe every three months. These are big, scary events anyway. Then if you add that big, scary event to peak trading hours, then it becomes, "Oh, my goodness, we can't do it." The other thing is, to start deploying much more frequently, and then it's much smaller changes, and you become more comfortable, and the risk gets smaller because you're making smaller changes. That's one way in which you can try and address code freeze.
Neal: Ian just finally mentioned the elephant in the room, which is humorous, because three of us are sitting in a very small room, imagine an elephant in this room, but which is risk, that's the root cause that so many organizations implement things like code freezes, because they put this on risk, it's too risky behavior. Ian's talking about a couple ways that you mitigate risk and that's really the root cause that you get as well, what can we do to ease your mind that something bad is going to happen if we started doing this more frequently? Whether it's better testing or--
Like Ian said, the more you do something it's not a big special event, it's just the thing we do every day and that becomes a lot less inherently risky, because statistically, it doesn't break most of the time, and only every once in a while. The more you build up all this accumulated stuff, the more likely it is to break. That's one of the great lessons from experience, continuous integration is doing it more often you end up doing a lot less of that integration stuff than if you do it all at the end.
Cassie: I would also add, like alongside that, yes, definitely mitigating that risk. There's that organizational safety aspect of it. Something that I've noticed as a result of these code freezes is what I mentioned before the blame culture, but then on top of that, it propagates. With the blame culture comes a hero mentality as well. Therefore, I have seen a lot of either the deployment team or someone who stays up all night, before the big event, and it goes off without a hitch. Therefore, that team gets a nice medal and gets really, a bonus, even sometimes I have seen that before.
Then, of course, people are either very scared to deploy and to do things because of the code freeze, or they are one of the people who are excited to be there at 2:00 AM if something goes wrong. With those two different things, that actually propagates a pretty unhealthy culture going forward. If you're trying to mitigate these risks, deploy often, it's about congratulating the hero culture and making it a safe environment for folks to feel safe to be able to deploy and push code and those types of things. That's a big one. We have seen different organizations who are able to deploy on a more constant basis, really look at things like blameless post mortems, really looking at, why did something happen?
It's not a blame thing, it's actually asking the five why's, what is the root cause, going back to what Ian said, what is the root cause to actually happen, and let's start working on those things. More often than not, we actually see that the symptoms are based off of really complex, big balls of mud monoliths, which is very hard to push go to anyway. What are the strategies that we want to put in place to decouple some of those things as a plan, as an example? Those are the things that should be invested in as opposed to covering the symptoms?
Ian: Yes, I think you've mentioned the big ball of mud there. Again, thinking about clients I've worked with, where they did have the code freeze. Another common factor was large, monolithic, complex architectures, where there was no graceful degradation. They had learned from bitter experience that the architecture tended to fail spectacularly in one big go, as opposed to seeing slow performance on certain things. It would just suddenly all stop working and so I think moving to more frequent deployments absolutely helps, but you may also need to make some changes to your architecture. Maybe introduce some patterns like circuit breaker.
If we are seeing very high load, maybe we make sure we don't pass that through to the legacy database or mainframe that will then crash completely that you protect those critical components from that load. I think the other thing that's out there as well, we were talking about very sync peak load and a lot of organizations say, the peak load is when we're seeing the most demand from our customers to use the system, it's not quite true. The peak load comes just after you had an outage on the busiest day.
That's the real peak load and I think that's something else that you often hear about these repeated failures, where the site comes back up five minutes later, it's gone again, the site go, they bring the site back up, everybody piles back in because they want to finish their buying what was in their shopping cart or whatever it was and the site goes down again.
I think the other thing is you have to architect full recovery from failure.
I think some organizations they've become so focused on avoiding failure, that they never stop to think. It will happen, you will have a crash, you have to plan for how you are coming back from that situation. The problem is if you haven't done that and you have an outage, it just adds to the trauma so suddenly it's taking two or three days. In fact, I worked for an airline who they actually had to suspend flights for several hours because what was happening is all the customers would come back onto the site to check the status of their flights and the systems would go down again and so, in the end, they just put a blanket.
"There are no flights today" on that website in order to get the load off the systems, because this massive behavior of users all coming back at the same moment, just cause this repeated crash.
Neal: This did end up in two complex to blip. We've been giving mostly one side of this story, but Ian mentioned early on the nuance of this. What are situations where it does make sense to do this and how can you mitigate the badness when there are situations where this is necessary?
Ian: I have one example, that's quite niche, but this was a client who was making scientific instruments. At the time they had no way to do over-the-air updates for their devices. It was blown onto a physical EPROM, I think the right term is, and then installed as part of the manufacturing process and so they would have a manufacturing freeze date and then make a huge batch of these machines, which would then ship out to the customers and if there was a mistake, you had to send an engineer to every single customer site with a bunch of new ROMs to plug in.
You want to code freeze. You want to know exactly what is in that software at the point it ships and that you've tested it and nobody is done a last-minute check-in that suddenly ends up in the code in an unexpected way so I think that was. I've heard the same from car manufacturers, although there that's changing because a lot of them are moving to over-the-air updates now. If you have a manufacturing deadline and it's going in a physical device, it's hard to update then something like a code freeze makes a lot of sense.
Cassie: Yes. I think another example would be around, we're not asking you to stop your code freezes today, especially if you have a lot of root problems that you need to uncover first because I would say that in some of the clients that I've had, we got brought in, maybe about three or four months before the Black Friday event. In that circumstance, I would say in the way we think about things in an iterative manner is how do we start instead of as an organization say, this is a massive code freeze that no one actually pushes anything to.
It's actually saying, "Hey, what are the areas that after you do the 5 Whys of where are the problem areas that we are trying to mitigate those risks essentially? It is okay to have very smaller code freezes on those specific areas as you're trying to fix the root problems, going forward but I would say then the next time that big event comes, hopefully, that code freeze goes away. Again, it's something that I wouldn't say, "Just remove all code freezes today," because there are inherent root problems that needs fixing and so I would say, start figuring out what the long-term strategy is on those things. Iteratively, start making those code freezes much smaller windows into the areas of the code that are quite risky at the moment either for performance or security reasons, and actually start looking at those areas.
Neal: I think that's super important perspective that too many architects don't take, which is that software engineering is really code over long periods of time. It's not the snapshots of one thing to another and you can't just instantly change one thing to another particularly large organizations. It's really about, can you make incremental improvements constantly? Even if you never get this mythical perfect in state, you can get better constantly and find the places where there are genuine risks that you want to isolate in places where it's a lot less risky, and you can be a lot more aggressive.
Ian: I have another example where I think-- When I describe it, you might think, wow, hold on, no one would do that. It's amazing how often I've seen it, which is, when you have organizations who have code freezes, you tend to get changed windows. These are periods of times when you are allowed to make a change. The problem is that, the same change windows is the operational people have, as the network people have, and the infrastructure people have.
I've seen several examples of where someone's trying to do a change to the software, and they haven't coordinated with the infrastructure team who are busy migrating a bunch of stuff to a new set of disks. The deployment is running incredibly slowly because in the background, all these RAID arrays have been rebuilt. There's just been no coordination and actually saying, we don't want too big changes at the same time, if they just coordinated a little bit more, probably we should not have done the deployment that day, it should have been paused. That, unfortunately, seems to happen remarkably often.
As I say, when you have these organizations who get very risk-averse, it tends to push everybody into trying to make the change during the same small window.
Neal: I've seen that with the QA in particular, you've got this small window and everybody's rushing to get stuff done. People who get caught up in the dogma of every iteration shall last this exact long, and the QA people get really stressed at the end of that. I've told a lot of people, "Why not just offset that one week?" That gives them a week and they're like, but the rule says that has to be, "You're stressing these people out, stop it." Exactly the Ian's point, thinking about the process of why things go out, rather than just saying dogmatically, this manufacturing metaphor that we've just read over and over again, helps get more effective performance.
Cassie: One thing to add to that is, I've actually seen this on a client side is that, we have stressed out the QA organization so much because of these dogmatic rules that, first of all, they're the ones who probably know the domain the best in your organization. I've seen an entire QA group resign after a very stressful situation, and now they lost all of their domain experts as well. That tail-on effect, it can be quite detrimental if that culture keeps propagating.
Ian: I do want to mention that something, if you like, the unspoken secret of change freezes. Most of them aren't really change freezes. What they normally mean is, the normal process of deploying software isn't happening. However, there are things that someone will say, "Quick, oh, my goodness, we've got a problem, quick make it change the prod system." There's almost always, they call it the tactical or the emergency deployment process, which in most places seems to be ignore all the testing and just do the change anyway.
If you like, here's the other irony of change freezes is, they're often not really change freezes. However, because there's no code change happening, because no one's meant to be checking anything in, what often happens is, the first release after the change freeze, you regress all the bugs that you fixed during the change freeze, because they never went into source control because you're not allowed to check in. You get this ironic behavior of the first release, not only if you've got all the pent up demand for the people who've not been able to get things done during the code freeze, you've also got you then regress all the bug fixes you did using your emergency deployment process.
Mike: Hopefully that gives listeners an idea of why we felt that was too complex to blip, since the blip there's a couple of paragraphs. Thanks very much for listening. Thank you, Ian and Cassie, for being our guests on this podcast. From Neal and I, thanks for listening. Please do give us a rating, however, you're listening to this, whatever platform you're listening on, and leave us a comment, telling us how you like the episode and what else you'd like to hear from us.
Neal: Thanks everyone, and we'll see you next time.