How to takeover a large-scale and complex legacy system

Yue Wang ,

Jianing Zheng and

Yaming Huang

Published: May 04, 2022

Project background

Project takeover from a client to a Thoughtworks team is critical and has a lasting impact on subsequent speed and quality of delivery. From the end of October 2020 to the end of December 2020, our team (C Team) took over a major payment gateway system with a nearly 20-year history from the client's team.

C Team officially took over the project’s daily operations and maintenance work in January 2021, becoming responsible for day-to-day operation of the system including 24/7 On-Call, and new feature development.

The entire takeover process, including the challenges we faced and the experiments we carried out, taught us much. By sharing the practices we set up: incremental goal setting, service takeover template, C4 model etc., we hope to guide other teams through this process.

Challenges of this takeover

Before we start exploring what we learnt, it might be useful to provide a little more complexity about the situation our team faced. Given the complexity and long history of this payment gateway, we expected to face several challenges from the beginning.

Complex business domain and outdated engineering practices

In terms of business, the payment domain is always complex, due to the many different functionalities that it supports. Lack of clear business and technical documentation worsens the situation.

In terms of technology, there were more than 100 services and over 300 repositories in total. This led to a number of problems and challenges: severe service coupling; many services having no pipelines, no testing environment or even source code; solving problems depending on manual changes on production databases; operating systems and software package versions being outdated; etc.
Aggressive timeline with large number of services

There were 100+ services involved and the initial timeline to complete transition was only 30 working days at the demand of the client team, as their team members were about to leave for other projects.
Lack of hands-on experience

Ultimately, true understanding requires hands-on experience. Taking over a long-established internal system is tough: after all, the people with knowledge about it are the internal team. As outsiders, we not only have a steep learning curve, we still need to build up practical experience as we go.
Remote collaboration

Remote collaboration also proved challenging as we needed to maintain frequent communication with client team members located in Melbourne, three hours ahead of the C team in Xi’an, China. Different first languages also made communication harder between the teams.

Our practices

Incremental goal setting

How do we generally measure the maintenance quality of a legacy project?

Immediate goal: at the very least, we need to run the business as usual (BAU). This means that our team should have the required knowledge and skills to handle online incidents and daily business work after the departure of the client team.

Long-term goal: start to make small continuous improvements. This means the team has enough business and technical context to build an improvement plan and action on it to deliver great value.

Based on this, our team divided the project into three stages. This incremental approach allowed us to build the plan and adjust based on feedback.

The stages of handover, from inception to delivering value

Figure 1: Eight stages of handover, from inception to delivering value

1. Takeover period

Goals: acquiring as much experience and knowledge as possible from the client's team.

Activities: reduce On-Call pressure; frequent presentation of periodical achievements to clients.

Focus: business knowledge, basic information about system and services, manual process, knowledge and skills based on experience.

2. Practicing via doing period

Goals: turn the team into a unit and have the ability to solve common problems.

Activities: start to handle on-call.

Focus: identifying the project focus; pair programming; resolving end-to-end business issues; learning from previous incidents, etc.

3. Continuous Improvement period

Goals: each team member is capable of dealing with on-call independently; delivering more value to the customer.

Activities: team members take turns to perform on-call with exposure to a variety of problems,

Focus: team sharing; solving specific problems and spreading knowledge one-to-one; deliver value through continuous optimization.

Setting up a baseline through the service takeover template

To ensure that the details of each service takeover were covered, standardized, and to promote the quick aggregation of useful information and output standard information for each service, our team defined a service takeover template. This template covered all the necessary information for a service, such as core functions, code repository links, test coverage, and easily omitted content such as technical debts and pitfalls or whether there had been any online issues. This template served as a clear acceptance criterion for the takeover of each service.

Each service also had a separate page on Confluence with clear records and documentation that could be easily referenced by the team.

We ended up creating 109 documents recording the basic service information, which greatly helped in our subsequent maintenance work and allowed the client team’s developers to keep a permanent record. This served as a powerful input for future improvements.

Using the C4 model to clarify system architecture

Ultimately, the service template simply records information. It’s useless if it can’t be used to understand and solve the real problems: business issues.

Therefore, after handing over an independent service or a series of services, we would use the C4 model, drawing two high-level C1 (system context) and C2 (container) diagrams to visualize the inputs, outputs and dependencies of each service. Experience showed that the drawing process itself helped the team better absorb fragmented knowledge. Images were also a more effective means of communication with clients given the language barrier.

Figure 2: System context diagram of the payment process system

In the next part, we’ll share more of the practices we implemented to overcome the challenges of takeover.

Filling in gaps through internal discussion

Given our time-limited and task-heavy situation, we adopted the “1 plus 1” model: pairing one of our team members with a member of the client's team, then selecting one service that the client was relatively familiar with to take over. In an ideal situation, seven services could be handed over in parallel daily while obtaining basic information according to the service takeover template to maximize knowledge gain.

However, this model also brought some problems:

Team members didn't know about the services that others had handed over and the relationship with their own services.
Some aspects would be missed when two people handed over a service.
During the service takeover process, it was still impossible for team members to get a comprehensive understanding of the entire system even with the upstream and downstream of the service listed.

In order to address these problems, we introduced daily internal discussions. Team members would take turns sharing what they’d learned while other members provided feedback, ensuring that services would be handed over more effectively. We received enough information for 3 hours of daily internal discussions. Due to limited time, we focused on high level understanding of the system instead of going down rabbit holes. This also reduced the risk of single point failure. Pairing further helped to improve availability.

Visualizing takeover progress

It’s important to show progress during the transition period from different perspectives:

Weekly takeover plan based on number of services to be handed over and the duration of takeover. Use it to show the weekly progress of each iteration and compare it to our initial projections. Risks and issues are also part of the scope.

Figure 3: Weekly takeover plan based on number of services to be handed over

Service architecture map. Team updated it daily to show progress, using green markers for completed services and gray markers for pending services. The visual element of piecing the map together even added a fun and motivating factor to the boring work of takeover.

Figure 4: The service architecture map

Minimizing risks through communication

It was incredibly important to synchronize the takeover issues and risks with the project’s main interface person, so we communicated with the client's Delivery Manager (DM) weekly. Our communication centered on the following:

takeover progress: weekly progress updates on takeover and unaccomplished content based on the service list and Burndown Chart to keep the client informed.
Obstacles: clarifying the issues encountered in the current stage, such as missing accounts and permissions, so that the DM could help us coordinate resources and eliminate obstacles.
Risks: emphasizing and documenting risks to the DM, such as the risk of delays. For high-risks issues, we invited our China Leadership Team and the client’s Senior Leadership Team to assist us.

We not only communicated frequently with the DM, but also established a good relationship with the client team’s main technicians and their L2/L3 Operations Support Team and Client Success.

Finally, we conducted retrospective sessions with the entire client team in each iteration, so that members of both teams could give feedback and share knowledge.

Increasing confidence through incident drills

Undoubtedly, 24/7 On-Call was a great challenge for the team. Our team felt stressed due to our lack of practical On-Call experience and of in-depth understanding of business implementation details. We found that rehearsing past incidents was an excellent learning tool to assess the impact of online incidents and learn to resolve them quickly.

The organizer would select a representative incident from past online failures for simulation, such as an incident integrating services with other gateways.
The team dedicated 2 hours to simulating an online incident, asking pertinent questions without relying on prior knowledge.
The team was divided into two groups, both of which identified problems and proposed potential solutions.
The organizer then reviewed and clarified the relevant knowledge points.

Adopting the above methods allowed us to quickly adapt to the rhythm of On-Call and allowed each team member to have first-hand experience as Primary On-Call.

Post takeover issues

Customize inception for legacy systems takeover

As we learned from the first stage of takeover, the Inception activities that we generally apply to the launches of new projects had very limited impact on the takeover of complex legacy systems.

Activities like “Hopes and Concerns” and RAIDs Logs were of great help, as they could effectively identify key problems at the beginning of the project so that we could carry out targeted management. They were useful tools in presenting our plan and scope for takeover to the client and let us adjust it based on feedback.

Activities like Trade-off Sliders, Elevator Pitch, Stakeholder Mapping and Empathy Maps provided little value in this context. It’s more important to have deep-dive sessions with the key stakeholders, rather than everyone

End-to-end view during takeover

It’s important to define clear roles and responsibilities without relying on assumptions. It’s a good opportunity to review and identify opportunities with other teams as well, e.g. L2/L3 operation support team. It was important for the team to have an end-to-end view regarding the entire process, including collaboration with other teams, so that team members could:

Understand the whole system
Build good communication and relationship with the support team
Have the opportunity to optimize and improve

Achievements

C Team’s achievements since taking over daily operations and maintenance in January 2021 speak for themselves:

Significant decrease in the number of incidents (went down from 11 to three per month between February and April) while availability increased to 100%.

Incident count saw a significant decrease following the takeover

Figure 5: Incident count pre- and post-takeover

Capable of providing 24/7 support, not only handling our own incidents, but also supporting other teams. The scope varied from user configuration/onboarding issues to complex performance issues.

Reduction of Main Time to Recovery to an average time of 3 hours. Significant uplift in knowledge management, including 109 basic service information documents and 30+ architecture/business diagrams.
Operation time improvement, up to 11 hours shorter for some business critical manual operations.

Grateful acknowledgement is made to Weibo Wang, Hao Gu, Shuo Gao, Mengyang Sun, Li Yan, Claire Boquet, May Ping Xu , Sichu Zhang, Kaifeng Zhang.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

How to take over a large-scale and complex legacy system