Modern systems are increasingly designed around event-driven architecture concepts. Essentially, you’ll have a bunch of microservices running on cloud, interconnected via fast asynchronous streams of business events.
There are many benefits of such an approach: resiliency, elasticity, flexibility. And because of these benefits, such an approach is at the core of many of the digital native platforms which have become so successful in the last few years.
Recently we have also seen the emergence of the distributed data mesh concept, which aims to extend the microservices approach to the data space and complete the spectrum of distributed architectures.
But what if you do not have the luxury of building something from scratch? What if you have to deal with legacy monoliths which you can’t afford to re-engineer? What if your legacy is the big ball of mud that nobody dares to touch? Are you prevented from entering the microservices party forever? Are you excluded from data mesh? Maybe not, maybe there’s another way.
The ingredients of the recipe
Figure 1: The ingredients: Cloud, CDC, DevOps and Domain Experts
To move towards an event driven architecture, while continuing to run your legacy — and without having to change a line of code — might sound like a pipe dream, but it’s surprisingly simple. You just need a set of modern technologies, a set of modern practices and good old business domain expertise.
First, we need domain experts who know how the legacy systems work and, more specifically, know the data model of these systems.
Then we’re going to use the cloud, which gives us the elasticity required, and change data capture (CDC) tools, that enable us to capture the roots of the events. Then we need “DevOps on steroids”, in essence we need to automatize every step, from build to deployment to rolling update, to be able to manage a complex dynamic infrastructure. We glue all of these ingredients with a domain-specific language (DSL), which becomes the Esperanto for domain experts, InfraOps people, business users and also IT financial controllers.
Let's make the problem concrete : a fictional requirement
The-Retail-Company is, well …, a retail company. As such, they manage customers orders via their order management system (OMS).
Figure 2: The fictional requirement — improve customer service and accuracy for finance
Once a customer places an order, The-Retail-Company contacts their Providers to understand when the goods listed in the order can be made available at the central Warehouse of The-Retail-Company. Then it contacts the Delivery company that will eventually bring the package to the customer.
Based on the information received both from the Providers and the Delivery company, The-Retail-Company is able to estimate the delivery date for the order. This date is saved in the OMS database. It may happen that either the Providers or the Delivery company have to change their commitments, in which case they communicate these changes to The-Retail-Company which recalculates the delivery date of the order accordingly.
The customer can check the expected delivery date by calling the Call Center, which accesses the OMS system online to provide the most up-to-date information.
To improve the quality of the service, The-Retailer-Company decides that in case of changes in the delivery date, it will issue a message in push mode so that the customer can be informed immediately. Here's the problem: how do we do this? Do we have to open up the old stratified OMS? Do we need to touch its code? This could easily become a nightmare.
To make things even worse, there’s an additional requirement coming from the Finance Controller. They want to monitor the turnover rate of the Warehouse twice a day and therefore are interested to know any variation in the forecast stay of items in the Warehouse. That twice a day frequency for communicating with the ERP is a very different requirement from the one for the customers, who need to be notified as soon as possible.
It all starts with the domain experts and the CDC
The domain experts know the OMS data model and know which are the tables to keep monitored to intercept potential changes in the delivery date of an order. For sake of simplicity, we assume these are just: OrderTable and OrderItemTable.
Figure 3: Algorithm to calculate the Order Delivery Date, Tables and Fields required
The algorithm to calculate the Order Delivery Date is simple and we can see that only few of the fields actually are important to generate “Delivery Date Changed” events. If the ArrivalDate changes for some of the items of the order, then the delivery date may change. Similarly if the delivery company communicates a change in the DeliveryTime, i.e. the time it will take for them to deliver the order from the warehouse to the customer, the delivery date of the order may change.
But changes in the PayStatus of the order or in the Shelf on which the items are stored can never generate a change in the delivery date.
Again, for sake of simplicity, the same tables and the same fields are the ones that need to be monitored to identify changes in the turnover of items in the warehouse and fire “Turnover Changed” events. The key point is that only some tables of the OMS data model, and some fields within these tables, are important for the generation of “Delivery Date Changed” and “Turnover Changed” events, which are the business events we are interested in. The domain experts have to point us to such tables and we need to put such tables under CDC monitoring.
The first Stage: data replica
Now that we know which tables to put under CDC, we can start implementing the first stage of our pipeline. The streams of events generated by the CDC are connected to some queues. These queues have consumers that just write those changes into a replica database once notified by the CDC.
In this way we’re able to maintain a real time replica of the legacy operational tables.
Figure 4: Stage 1: data replication
This is a crucial step, but it is just the first= too often, it becomes the only step. But a simple replica of the legacy database isn’t going to solve the problem we have: the need to generate real business events.
The roots of the business events
Events generated by CDC are low-level, extremely granular data change events. But somehow this noise hides the roots of the business events we are interested in: “Delivery Date Changed” or “Turnover Changed”.
What we need to do is to work on this source to remove the noise and extract only the real business events we’re interested in.
The second Stage: filtering and grouping by business events
We have learnt that only some fields are relevant for the events we’re looking for. So the first thing we do is filter only these fields and ignore any change which affects other fields.
Figure 5: Filter relevant fields
It may be quickly apparent that many CDC events actually represent just one single business event. For instance, if a Provider communicates new ArrivalDates for many items in one batch, this would turn into a batch of updates in the rows in the OrderItemTable. But all these updates represent just one single “Delivery Date Changed” event. So we should apply a second rule to group the CDC events that represent just one event.
Figure 6: Group by business event
On a stream of events, grouping is possible only if we introduce a “time-window” buffering. So the second stage is the stage where we filter and group, over a time window, the CDC messages, such that they only extract the real seeds of our business events.
Figure 7: Stage 2: filtering and grouping
The third Stage: running the business logic
What comes out of the Stage 2 is an indication of a possible business event, but this indication still needs to be verified with specific business logic. For instance, looking at the example above, the ArrivalDate of the second item can be anticipated with no impact on the delivery date of the order, since the arrival date of the third item would be always subsequent and will therefore determine the actual delivery date of the order.
This means that when we receive an event out of Stage 2, we need to execute some business logic to recognize whether we’re facing a real business event or not.
In case the business logic tells us we are facing a real business event, such an event can be made available for consumers on a queue. At the same time, we can use the event and the data read by the business logic to update a view on the Orders, simplified with respect to the OMS data model for Orders, which is optimized for those consumers external to OMS. This can be seen as a step towards building a data product for orders, in line with the data mesh principles.
Figure 8: Stage 3: create the business event
Running some business logic — maybe accessing the data stored in the replica database — will be necessary, since we’re dealing with legacy systems, which we can’t afford to change. Modern event-oriented systems don’t need a similar step, since they generate full fledged business events at the source. But in case of legacy systems this is often not the case, and this is why we need to run some business logic and have a replica database from which to read the data needed to build a real business event.
Compose stages to create new pipelines
We also have a second requirement. We need to generate “Turnover Changed” events when the forecasts stay in the warehouse for item changes. The tables and the fields that we have to monitor to identify the “Turnover Changed” events are the same as in the case of “Delivery Date Changed” events. The difference is the business logic that we apply in Stage 3.
The modular nature of the pipeline allows us to easily plug in a new Stage 3 component, integrated into the pipeline via its own queues.
Figure 9:Components as plug ins
With the same mechanism we can integrate different components in all stages of the pipeline.
Building business events with the business logic
We said that Stage 3 runs the business logic to finally build the business event. But what is this business logic? Well, it can actually be different things depending on the context, but it eventually boils down to two main families of implementations: libraries of functions and implementations in the language of the replica database.
Business logic can be expressed in form functions which implement the required algorithms and accessing, when required, the data in the replica database. Such functions are grouped in libraries and identified with specific naming conventions.
Sometimes the business logic is simple enough to be expressed directly in the language of the replica database — perhaps standard SQL or the specific query language supported by no-SQL databases.
Pipelines as configuration
If we look carefully, we may notice that the components of the pipelines and the relationships among them can be described in terms of relatively few elements: the tables and the fields impacted; the time window to use; the filtering and grouping rules; the business logic to apply; the sequence of the components in the pipeline.
This means that we can define a configuration protocol to describe the system of pipelines that we need to build. This is the first building block of our DSL (domain-specific language) and a key concept of the entire solution as we will see down the line.
An elastic infrastructure
All the modules of the pipelines are parametric software components.
In a containerized world, each stage could be implemented as a Docker container. For each pipeline, such containers are instantiated with the parameters required by the specific pipeline logic.
Figure 10: A containerized implementation
These components need an infrastructure to execute. Our pipelines may have very different non-functional requirements (NFRs) and SLAs. One pipeline, for instance, may need to notify in real time its business events; another may have more relaxed requirements. A pipeline may have to process massive loads in certain time windows, e.g. during the night because of batch processing, and have a much lower load during daytime. The cloud therefore is the ideal place to run such infrastructure. The cloud, if properly managed, can guarantee the flexibility and elasticity required to meet the SLAs with optimal efficiency in the use of resources.
By clustering containers and queues on the cloud, we can assign the resources required by each pipeline and we can dynamically change such assignments to best respond to different situations, e.g. load profiles and SLAs varying over time, considering also the cost constraints for the cloud.
Figure 11: Clusters and elasticity
We can configure the resources initially assigned to the clusters as well as their thresholds. Similarly, via configurations we can specify SLAs targets and cost profiles as constraints. This configuration grammar is the second building block of our DSL.
A DSL for business events
Let’s recap the key points.
The business expresses the functional requirements: which business events they want to generate, how they should look, who should consume them. The business specifies also the desired SLAs (non functional requirements). Budgets dictate the cloud cost constraints that need to be respected.
Domain experts from the application teams define, via configuration, a series of pipelines that generate the required high-level business events out of CDC streams. These pipelines are the implementation of the functional requirements expressed by the business and constitute their executable documentation.
Infra Ops experts define, via configuration, the shape of the infrastructure required to run such pipelines, including the flexibility we want each component to have and the cost constraints the entire solution must respect at run time. This is the implementation of the non-functional requirements (SLAs and cost profiles) expressed again by the business and by the controllers.
Figure 12: A DSL for Business Events
These configurations therefore constitute a language which describes the entirety of the requirements that are coming from the business. This language is what we call the DSL for “business events to be generated via CDC streams”.
In other words we have translated the complete set of requirements in a configuration governed by a formalized language. This configuration removes any ambiguity, is totally transparent and can be inspected at any time to understand what the system is supposed to do. It can also be versioned together with the rest of the code.
Everything as code — DevOps on steroids
If the entire system can be defined with the DSL, then it can be automatically built, automatically tested, automatically launched and dynamically adjusted according to the DSL configuration specification. All these automations are implemented by the DevOps tools which are an integral part of the platform.
Figure 13: Everything as Code — DevOps on steroids
When adopting such a model in real business scenarios, the complexity can quickly grow. We can easily have to deal with hundreds of instances of components running in parallel. The DevOps tooling becomes the cockpit through which we can control this complex car.And once the car is well controlled though, the returns it gives is impressive.
From a business standpoint we’re able to free up value from legacy systems without having to enter their internals and touch their stratified logic.
From a technology standpoint, we have an elastic system able to extract the most of the value from the cloud technologies.
An organizational model as per Inverse Conway’s Maneuver
So far, we have built a data infrastructure platform which application teams can use to create business events and, to a certain extent, data products via the platform DSL.
The platform offers plugins for business logic modules, services to define the composition of data pipelines, services to control dynamically the resources assigned and the costs, services to monitor the execution of the entire system.
Figure 14: Recommended organization
The platform does not own any application logic but offers services to application teams to implement the full set of requirements coming from the business, both functional and non-functional.
Conway’s law tells us that the design of a system reflects the structure of the organization that builds it. Inverse Conway Maneuver is a technique adopts an organizational structure that reflects the desired architecture. Following such a technique, if you want to adopt the approach outlined so far, we recommend having a central cross team responsible for the data-infrastructure DevOps platform. The services are then used in a self-serve mode by the application teams leveraging the DSL of the platform and its APIs.
The application teams are the only ones responsible for defining the desired business events and data products. The platform team offers support but stays away from business requirements.
We started with a legacy monolith unable to integrate in the highly distributed world of microservices and data products and we ended with a data infrastructure platform that enables this integration to happen, with no need to touch the stratified, barely known, barely tested code base of the monolith. We just use the platform services and its DSL.
To reach this goal, we use consolidated technologies such as CDC and leverage the cloud extensively, using DevOps as the glue.
We still need domain expertise, but the platform offers self-serve capabilities to the domain experts and promotes a clear separation of responsibilities among teams. This is a way to unlock new business value out of the legacy world, with limited cost and limited risk, and provides an entry door to the new space of distributed systems to all who have to deal with the burden of 20, 30, or even 40 years of stratified logic.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.