Today many companies rely on web analytics data to make the right decisions and drive their business to a better place. When we took ownership of one of our client's systems, which provided such analytic data, we found a case full of architectural complexity. This was a consequence of good and not so good decisions, which required many hours of investigation, analysis and discussion and implementation to get things back onto the right path.
Our client operates a digital marketplace which connects more than 30 million potential buyers with sellers, both professional and private, every month. More than 40000 professional sellers had partnered with our client at that time, and they were the main users of the system we started to work on. Understandably, a web analytics platform is a really useful tool to these professional sellers, who can see how their businesses are performing online and take decisions for the future.
The purpose of the system we just inherited was to provide fresh data about views of their listed articles in searches performed by consumers, visits to their listings' pages, leads sent to them via the marketplace, etc. The ability to show near real-time data was one of the main architectural characteristics of the system when it was designed. That decision was something we found when we took over ownership of the system, but we never properly discussed the business context that led to it. In retrospect, this requirement seems to have been more of a question of technical prestige than a real need of the users of the analytics platform.
Our problem with real-time data
What was very real about this requirement was the impact it had on many aspects of the overall system architecture. The original creators of the system created a very complex solution to continuously update performance statistics. First of all, different sources sent events, and a single entry point captured them. This entry point was an AWS lambda function which performed some clean-up job on the data and passed it over to the next processing node, in an ad-hoc built data pipeline.
Here is where things became tricky: the system had to match events related to the object being sold and the seller, so that it could calculate statistics aggregated by sellers downstream. Unfortunately, such information was not available at all possible data sources. Even those which had it faced the problem that the information might not be available when the event was generated (sometimes it was available later, after the event had been sent). As a result, there were many events which were orphaned, and the pipeline could not process them correctly. In this case, they were stored for later processing – something which happened on a scheduled basis, several times a day.
This process which tried to match orphaned events with sellers was very brittle, and numerous events were not processed at all. When statistics were checked, the totals for a particular item and the aggregate value per seller always displayed lower numbers than the statistics available in other platforms (like Google Analytics). The differences could be substantial –it could be up to 30% on a bad day–, and this created a lot of uncertainty about how reliable the system was.
Getting into the depth of the problem
What we found when we took ownership of the system was that it had been designed to primarily meet one specific requirement, providing real-time data, but ended up providing figures which were unreliable.
After tons of interesting conversations asking many times “why”, we realized the real-time capability was not particularly useful to the users of the system, even though there was an explanation for it. The original builders of the system created it as an improved version which would substitute the original statistics collection system, which operated daily. The old system operated by crunching the data available in a relational database to provide the statistics which were relevant to sellers. When they proposed a new version to substitute the old, legacy system, they articulated the value of the new one around the possibility of offering near-real-time statistics.
However, users looked at the data provided by the system daily, mostly to check historical data and identify trends. They cared about accuracy and consistency in the data and were not much bothered about getting real-time updates of it. After exploring the problem space, our team found that the system might have been architected to satisfy the wrong requirement or, better said, it was prioritizing the requirements in the wrong order.
Now that accuracy was the requirement at the top of the list, the team clarified what the expected level of accuracy (or tolerance to errors) was, and started to build tooling which provided observability on that metric in a reliable and visual way. Everyone in the team could see at any time the current state of the accuracy the system was providing and had a clear idea on whether the design choices the team made were having a positive or negative impact on what really mattered. If at some point during the processing of events, there was any leakage causing a drop in accuracy, the measurements we put in place, defined as an SLO, were clearly showing it, and we could start working on a solution.
In parallel, once the list of cross-functional requirements had been clarified and prioritized according to the needs of the users, we started ideating possible solutions to meet them. For this, we organized several design exercises, similar to architecture katas, where the team worked in pairs to ideate alternatives that would fit the requirements. We met regularly to compare the different options and collectively decided what the new architecture should look like. We also ran multiple deep dive sessions with the few developers left from the team which had created the system and were still around. The goal of these sessions was to understand the failure modes they had identified in the system and the parts they considered that were causing the majority of problems and were good candidates to be changed. This was particularly important because we wanted to leverage as much as possible the knowledge the organization had acquired through several years operating the system and avoid designing a new system which would fall into similar pitfalls than the current one.
After building the needed observability on the system, the time to implement the changes did arrive. We proceeded by removing the existing real-time flow, as agreed with the stakeholders. We proposed a proof of concept to show that an alternative processing flow which did not consume events in real-time would fix the accuracy problems. We built it and put it to work in parallel to the real-time one, so we could see the differences. After validating our hypothesis, we completely built a full fleshed-out processing system and finally substituted that buggy realtime flow with a new and reliable one. Thanks to the observability tooling we built, the process was executed with the confidence that we were doing the right steps towards the final 99% goal of accuracy in our data.
Architectural characteristics (or cross-functional requirements) are one of the most important points to clarify when deciding on the architecture of any software system. Understanding the ones which your system must support is as important as designing intelligent solutions with cool technology.
Whenever a new cross-functional requirement comes in (as the need for real-time updates in the system), the design of the system will be affected (in many cases, in a substantial way). It is hard to achieve a high level of excellence meeting all the requirements which are critical for a system. A better solution for one of them will impact the others, so, most of the time, you will have to evaluate the trade-offs and make compromises. Also, don't let a new requirement get all your attention just because of its novelty (or appeal). It is always good to think about how its importance compares to the ones that have already been identified (in our case, maybe having data being updated more frequently was important, but not at the expense of losing accuracy). In the book “Software Architecture: The Hard Parts”, there’s a very interesting sentence saying:
"Don’t try to find the best design in software architecture, instead, strive for the least worst combination of trade-offs"
For us, having a very clear definition for the requirements was invaluable. Try to get that definition linked to concrete numbers. For example, we did not want accurate data, we wanted an accuracy of 99% in our data. With monitors in place to have visibility on those numbers at any time, we could see how they evolve. When things started going south, we could take immediate action.
Fun fact: some months after our intervention to remediate the problems with the system, the client asked us to start thinking how it would be possible to provide fresher data to the users. With the understanding we acquired about what was important for the system and how to measure it, the task looked less daunting. But that's maybe a story for another time.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.