Enable javascript in your browser for better experience. Need to know to enable it? Go here.

How to make your next Black Friday stress-free

An interview with Glauco Oliveira, Thoughtworks Principal Engineer

Glauco recently spent Black Friday working on-call support for one of Thoughtworks’ high profile European retail clients. He and his team were responsible for the availability and smooth running of systems underpinning the core customer journey from search to checkout.

 

I took the chance to ask Glauco about his experience and to learn how to keep crucial retail systems up and running during this critical period. Glauco is an experienced technical leader, and his answers are relevant to both engineers and technical leaders working in retail who are likely already looking ahead to Black Friday 2024.

 

If customer-facing systems go down, online shoppers can’t buy what they want and the retailer stops making money. It would be the equivalent of the lights simultaneously going out across an entire chain of high street shops, and yet an outage could be the result of a single misplaced character.

 

As for most retailers, Cyber Week and the holiday period are pivotal to the financial results for the entire year. In 2023 in the U.S. alone, online consumers spent $9.8bn on Black Friday and another $12bn on the following Cyber Monday.

 

This spend comes with challenges. Traffic increases by multiples during Cyber Week as customers seek to take advantage of time limited bargains. This can stress or even overwhelm systems that functioned perfectly during the rest of the year. To add to the challenge, customers behave differently during sales so it is difficult to extrapolate from patterns observed during the rest of the year.

 

According to Glauco, good team and technical practices throughout the year are the key to reliable systems in November.

 

What would be the impact if one of the systems you maintain had a problem during Black Friday?

 

There are a range of possible severities. Not all problems are equal. Part of good incident response is differentiating between levels of impact so that you can act proportionately.  

 

This client categorizes incidents into three distinct groups, according to its customer and business impact:

 

  • SEV1 is a critical issue affecting a significant number of users (1,000+) in a production environment and many thousands of orders per hour (5,000+) are expected to either be delayed or lost. This type of incident would usually cost millions of euros.

     

  • SEV2 is a major issue affecting a subset of users (100+) in a production environment and up to thousands of orders per hour are expected to either be delayed or lost. This type of incident usually costs hundreds of thousands of euros.

     

  • SEV3 is a moderate incident causing errors or minor problems for a small number of users (fewer than a hundred) but there is no negative customer impact. In many cases these incidents cost either nothing or a few thousands of euros.

     

Back to the original question, having the systems down during Cyber Week for an hour would have been a SEV1 incident and would cost millions of euros, as this retailer makes around 7% of its annual gross revenue during this period.

 

As an engineer, it’s sobering to realize that the cost to the business from one mistake could be orders of magnitude greater than my annual salary.

 

Does awareness of the financial impact of outages create intense pressure? Developers are used to all sorts of safety nets, rolling back, reverting, CTRL+Z but you can’t get back lost sales at a peak time.

 

It does create a lot of pressure. The leadership is aware of this and have implemented several mechanisms to prevent catastrophic incidents. Preparations for Cyber Week start around July and there is a dedicated committee to drive the efforts forward.

 

Part of the preparations include grouping interdependent systems and running load tests every two weeks in live environments, alongside the actual customer load. Every team is expected to describe their expected result before the tests run, and if they fail to achieve the desired performance, they come up with action items for fixing the problem.

 

If the issue is related to an external dependency, or a major change that is out of the team’s control, it’s entered into a centralized risk repository that is managed by the Cyber Week committee which makes sure all risks are mapped and most mitigated. Some teams adopt the practice of running such load tests every quarter throughout the year, ensuring that their systems are always ready and giving the teams early feedback on system performance long before Cyber Week.

 

In terms of safety nets, they adopt code freezes and as Cyber Week approaches, teams are expected to release fewer major changes. At some point, only critical bugs can be sent to live environments. A week before Cyber Week, no deployments happen unless there is a strong case to do so.

 

Some engineers are skeptical of the efficacy of code freezes. What’s your opinion?

 

Code freezes are a controversial topic because some organizations adopt them as a substitute for hardening their deployment pipelines and improving their engineering practices. Code freezes can also defer problems to the new year, when a bunch of pent up changes are released at once and can interact in unpredictable ways.

 

However, I do believe that high maturity engineering organizations like my client are right to implement a differentiated deployment policy around Cyber Week, given the commercial importance of this intense period of time. Most outages are triggered by changes, and fewer changes does reduce risk.

 

In their case, the organization throttles the rate of non-essential high risk changes during the critical period, and instead focuses everyone on operational excellence. This has been effective at avoiding preventable costly interruptions to their business. 

 

However, organizations considering code freezes should keep the window short and not use it as an excuse for poor engineering practices. They should also be careful to gradually “defrost” from the code freeze and release accumulated changes in a controlled manner. That way, any issue can be easily traced back to the faulty change that caused it.

 

What advice would you give to a developer who is about to go on-call for their first high-traffic period?

 

If you do everything, or most of these things, right, the on-call is going to be extremely boring. My advice is that one should start by having safety nets in place:

 

  • Practice continuous integration and make sure that every deployment happens through your automated pipeline. That way, each deployment validates your release process and you reduce the possibility of incorrect manual changes made under pressure.

     

  • Write automated tests. Lots of them. They are a proven mechanism for preventing the release of bugs. They also make for a less stressful work environment for engineers. 

     

  • Collect and plot business and technical metrics on a dashboard with alerting in case anything goes wrong. You should never rely on your customers complaining to discover an issue.

     

  • Cultivate a blameless culture in regards to incidents. Hold post-mortems and take prompt action on the items identified. Incidents are gifts of knowledge about how your systems behave under stress – don’t waste them.

     

What happens if something goes wrong in production? Can you talk me through how a mature organization finds the problem, figures out who to work on it etc.?

 

There are some different ways of knowing something went wrong: when we deploy, we have annotations in our dashboards that mark the time of deployment. If any metric starts to shift too much, it may be a sign that the deployment had a negative impact on the live system. But this type of annotation only helps you when you are actively looking at a graph, which may not always be the case.

 

Another way of acknowledging a problem exists is through monitoring — production checks that run periodically and alert you in case a problem occurs and requires human intervention. Not all problems have the same criticality, so they are grouped into different buckets (red, orange and yellow).

 

Red alerts should be fewer, but when they trigger it is because there is an impactful problem in the live system, which may impact customers. A developer is expected to immediately analyze and fix them. They should always be linked to a playbook which explains how to analyze the error and fix it. Orange and yellow alerts should only target team chat and there is an expectation that these problems should be fixed during working hours.

 

Sometimes even when you have a well-structured and documented deployment procedure and monitoring, you may introduce unwanted bugs in a live environment. A support chat channel — where colleagues who operate systems dependent on yours can discuss problems or anomalies — is important for diagnosing when subtle issues happen.

 

What kind of things should you do the rest of the year to make Black Friday stress-free?

 

There are things you can do to keep your systems fit and healthy throughout the year:

 

  • Keep a team technical roadmap up-to-date with known issues and get leadership support to prioritize the most important topics

     

  • Continuously refine your monitoring by either adding, changing or removing checks and alerts

     

  • Improve the delivery pipeline by decreasing its execution time and adding or removing steps that would benefit the team. Minimizing manual testing is key!

     

  • Strengthen the system testing suite through the addition of new tests and also through the refinement of existing ones. Mutation testing has helped us identify weak tests that could be improved. Property-based testing has helped us identify some scenarios where serialization was not working correctly

     

Thank you to Glauco for your advice on practical steps teams can take to reduce engineers’ stress and protect the availability of important systems at critical periods in the year.

 

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Explore more insights