Tackling the challenges of using event-driven architecture in a billing system

Hongxing Chen

Published: January 12, 2024

Event-driven architecture (EDA) is a powerful software design pattern in modern application development. In it, a system is built around events (as the name suggests) — these refer to significant changes in a system's state. Various components in an application (such as services, processors and applications) can subscribe to events they're interested in; when those events occur, the components are notified and can respond accordingly.

EDA has a wide range of applications, from real-time analytics to e-commerce platforms. In my current project, we've adopted an EDA architecture to build a billing system where a high level of accuracy is paramount. However, we've encountered several challenges, particularly from the perspective of our event consumers. In this blog post, I'll discuss some of these challenges and how we addressed them.

The business context for event-driven architecture

Let’s first consider the context in which we might use EDA. In a traditional billing system, charge price is perhaps the most important element. The charge price that clients ultimately pay is influenced by a number of different subscription operations, such as subscription creation, product upgrades and changes to subscription discounts. The billing system will generate and deliver bills for active subscriptions every day based on the subscription events that are in place.

We adopted EDA to build a new subscription and billing service separate from the legacy system. This system was extremely outdated and incapable of supporting new business activities. The legacy system was struggling to meet the demands of the business, thereby restricting the potential growth of the business.

Inside the architecture

So, we’ve now set the business context in which we’re using EDA. Now let’s take a look at the architecture. It consists of the following components:

It’s worth highlighting some of these to give a better illustration of EDA:

Subscription Service is the producer of subscription events. It receives data changes from the legacy CRM system and produces new subscription events after comparing and identifying data changes. The event payload includes the event ID, start time, creation timestamp, event type (such as subscription start, update, cancellation) and additional information such as discounts.

Event message broker is the AWS SNS (Simple Notification System) component that acts as the broker.

Subscription Event Handler is the consumer of subscription events, using AWS Lambda to subscribe to the SNS topic for subscription events. Upon receipt of the subscription event, the event handler calculates the price and its effective data for the new subscription charge and the end date for the previous subscription charge based on business rules. It then saves the information to the billing core service.

Billing Core Service is the final recipient of subscription charges. It records all relevant charges, then generates and delivers bills based on the charges at fixed times each day.

Event-driven architecture challenges — and how they can be solved

As with any architecture pattern, EDA nevertheless poses some challenges. Fortunately, there are ways of overcoming them, as we found when using it on our subscription and billing project.

Message idempotency

There are two approaches that can be used to ensure high availability of message delivery in event-driven architectures:

At-least-once delivery
Exactly-once

Exactly-once requires an event producer to deliver each message only once and maintain distributed transactional consistency. While AWS SNS usually only delivers each message once, duplicate messages may occur on the subscriber side due to changing network conditions. Consumers must consider how to guarantee message idempotency in their designs.

The two common options are:

Ensuring data consistency through repeated processing: this requires business components to process the same event input and produce the same result.

Discarding processed events: after each event is recorded by the system and assigned a unique ID, duplicate events are discarded.

The subscription event handler is a stateful service with side effects in the billing system context, as it must cancel the previous subscription fee upon receipt of a new event. As a result, the event handler relies on the subscription charge status of the billing core service for charging. The first option often results in inconsistent processing as the same input processed with different subscription charge states of the billing core service produces different outputs. We chose the second option — discarding processed events — and take into account additional factors such as message temporal disorder and business instability before we implement it.

Event processing order

Generally, the event broker will try to deliver events in chronological order. However, network problems can cause event consumers to receive events out of order. This means event consumers must also consider handling events out of order.

Whether you should ensure strict order between events will ultimately be determined by business requirements. In the billing context, since the previous subscription needs to be ended when the latest subscription is received and the difference in price between two subscriptions affects the processing result, a strict order of events is necessary. This means that events should be processed in sequential order for the same user.

So, to handle the order of events in our EDA implementation, we took the following measures:

In event schema design, we determine the order by designing a version number or ID for the previous order event for the same user's event.
Ensure events always follow the order of consumption, discard disordered events and wait for the arrival of events that match the order.

In the case of two events of the same user being out of order, we use AWS Lambda's asynchronous retry mechanism, which supports two retries. The handler will throw an exception when processing event two before event one arrives. SNS will retry event two for processing in one minute. If event one is processed successfully during this period, event two can then be consumed normally during the second try.

Let's explore more challenges and how to overcome them.

Handling late events

When the event producer is unusual or the event occurs at a boundary time, such as using the CRM (Customer Relation Management) system to cancel a subscription at 23:55, the subscription service will experience a delay in receiving the cancellation event. This means the billing service can only receive it at 00:05 the following day — this obviously affects the accuracy of the data for a billing system. This is why it's important to think about how to handle late events.

To do this in our project, we adopted a strategy of adding an ‘event tolerance period’. Doing this ensures that late events can still be processed correctly even if they fall within a certain tolerated range.

Again, it’s important to return to business requirements when making decisions like this. For us, the billing system needs to send out invoices periodically every day, which means the billing system's data processing window is twenty-four hours. Taking into account the business's requirements for the billing time, we delayed the time window by three hours, processing the events of the past twenty-four hours (03:00 the day before to 03:00 the current day) at 03:00 every day. When a user cancels a subscription at 23:55, the event can be received and processed correctly by the system at 03:00. By delaying the time window by three hours, a three-hour tolerance period is obtained within the acceptable scope of business tolerance.

Republish events

When new systems are created or legacy systems migrated, there can sometimes be a discrepancy between the progress of event producers and consumers. This means that our architecture design also needs to consider the way replayed events will be processed from the perspective of the event consumer. For example, if the billing service is not yet able to consume a new product, but the subscription service is, the event producer must resend the event to resolve the interruption once the billing service has implemented the consumption capabilities it needs.

Doing this addresses two problems:

Because not all consumers need to consume republished events we can leverage the attribute feature of AWS SNS to add a replay attribute to republished events. Consumers can choose whether or not to subscribe to consume these events based on their needs.

To address potential performance issues caused by republishing an entire event, the event handler can use the Lambda service when publishing all events at once — concurrency settings can be adjusted to control the rate of consumption. We adjusted the concurrency of Lambda service based on the performance of the billing core service database to ensure that all republished events can be consumed quickly within an acceptable range.

Test

Integrating multiple components into the event-driven architecture of the billing system introduced new challenges for testing from the consumer's perspective. It required significant focus on ensuring the accuracy and integrity of the data. Currently, we use two types of testing methodologies:

Unit testing is the main focus, where the event handler serves as the core logic module for event processing in the billing system and includes tests for known scenarios. When creating unit tests, scenarios are simulated from the user's perspective and include both single-operation and multi-operation events. When we build unit tests for this project, we use fake producer and data builder components due to significant data dependencies — these simulate the behavior of the producer in a semantic manner to generate data that corresponds to the scenario. Doing this reduces the difficulty of building tests.
Verifying the eventual consistency of the data is secondary. The unit tests cannot cover unknown events and because the legacy system is being used as a data source for the subscription service, verification is extended. To ensure the eventual result is accurate, we conduct eventual consistency validation for events that involve the same user. This confirms the accuracy and integrity of subscriptions charges, as all subscription charges related to subscriptions are stored within the billing core service.

Doing this should ensure that:

Each subscription event has a record of the corresponding subscription charge.
At any given time, there is only one valid subscription charge record per user, and that the latest subscription charge reflects the recent product subscription event.
We have the ability to predict future bills based on data and verify accuracy for a specified date.

The benefits and limitations of an event-driven architecture for consumers

Using EDA ultimately allowed us to better fulfill business requirements. The billing system consumed the subscription events, stored the full-volume subscription fees corresponding to the events in the order established in the requirements, and was able to encounter and handle multiple adjustments in the construction process.

Having gone through this process, the advantages and disadvantages of EDA have become clear to me. Although it’s true that these will be shaped by the context in which you are using EDA, working on this project has highlighted a number of different things that I think others should consider.

Here are the advantages:

High performance: The billing system now saves the subscription for each subscription event. This saves network interaction time during daily full-user billing.

Traceability: With a timestamp on each subscription event and its corresponding charge, it becomes much easier to identify issues in the rare case of a chargeback.

Time traveling capability: The system is able to retrieve the relevant data based on the event timestamp at a specific point in time. The billing task can then select the dataset based on a specified timestamp and retrieve the corresponding billing results since events and their related subscription charges are read-only. For billing prediction, the charging task date can be set to a future date, or the task can be retried when performed.

Finally, there is one significant disadvantage:

Unnecessary Complexity: The event source of the subscription service, which is a legacy system, allows more user operations. This results in events with more complex business logic, spreading the complexity across multiple business areas. Although the billing system only focuses on the final state of customer subscriptions, it still needs to consume and store all events to maintain consistency between subscription service and billing service.

Despite this additional complexity, EDA was beneficial for this specific use case. It made working through the complexity worthwhile. Of course, that won’t always be the case — what matters is paying close attention to business requirements and doing a thorough and considered evaluation of the best approach.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights