Some time ago, Steve Upton and I wrote an article about Feature Toggles. Since then, through many discussions and applying the ideas from the article, I encountered many new challenges and questions around Toggles which I like to address in part 2 and part 3.
Again, this article uses the term “Feature Toggle” mostly to describe “Release Toggles” or short-lived feature toggles, that should ideally have a maximum lifetime of maybe one sprint before being cleaned up.
The two key issues that people are facing with toggles can be summed up by:
- When a toggled feature is deployed to production, it is not always fully “hidden”.
- Creating and testing a toggle for a new feature creates more overhead and risks than the feature itself.
Feature toggles influence the performance
The idea of Feature Toggles is to deploy new features to the production environment without actually releasing them. Releasing the new unfinished feature might create problems, therefore we want to hide it. But at the same time we want to be ready to release it any time. To achieve that, the state of the toggle is checked every time when the code part including the new feature is executed.
If the toggle is dynamic and it’s stored in a database, then we have a database call every time. If it’s in the main flow of the software, this can be thousands or even millions of database queries for a rarely changing value. This way, production’s performance can be influenced by having too many toggles (or toggles at all) even before releasing any feature.
A way around that might be to optimize your toggles database for read performance. This can have the drawback of additional costs. To not drive the costs too high, you should clean up the toggles quickly and never have more than a handful of toggles existing in your code at a time.
If additional costs are not acceptable, short caching of the toggle or switching to static toggles (which require a deployment to flip them) can be considered. Be aware however that this cuts off key benefits of using toggles in the first place: You cannot flip the toggle quickly as an emergency switch if after releasing anything goes south — you have to wait for the caching time or the deployment pipeline. You cannot flip the toggle quickly to showcase the new feature to your stakeholders in a presentation. You cannot flip the toggle quickly to release a feature at an exact point in time.
If your application has a frontend and your feature mostly affects the frontend, consider hiding the feature behind a header or query parameter. This way, the toggle does not need to be fetched from a database, thus not increasing the traffic. Be mindful that these “toggles” might be discovered by malicious users. They should only hide unfinished features that cannot cause any harm if discovered.
Feature toggles are difficult to implement for infrastructure code
When implementing code that is dependent on infrastructure, usually the business logic which we want to toggle is written in a different language and deployed at a different time than the corresponding infrastructure code. Therefore, we have to be careful that all the required infrastructure (for both scenarios of the toggle being enabled as well as disabled) is present at all times where the toggle can be flipped. The toggle then can be used to flip which infrastructure should be used within the software.
The question of how to solve this problem depends on the concrete infrastructure that is required by the code: Are there shared resources? Can duplicate infrastructure lead to duplication of events (e.g. introducing a new event queue with a different configuration)? Can you decide from within your software which infrastructure is used (e.g. switching from one Kubernetes cluster to another cannot be done with a simple toggle)?
Sometimes we have to take a step back and use a static feature toggle here instead of a dynamic one to ensure that changes are only applied to non-production environments before the feature has been proven to fully work. This slows down our development process and includes the risk of an increasing delta between the environments. But we can make sure that the toggle used in the business code has the same state as our configuration in our infrastructure code.
If switching to a new infrastructure setup is the main feature of your story, consider duplicating the infrastructure, and route the users only to the old version until you change the routing in your infra setup. This can be considered a feature toggle as well, even though the implementation is maybe different from what you have in mind when toggling backend or frontend code. An advantage of doing this is that you can test how your new infrastructure integrates with the production environment without releasing it to the customers yet. Likewise with business-code changes, the release itself then can happen with zero downtime and without any delay.
Kief Morris’ book “Infrastructure as Code” covers feature hiding and toggling of infrastructure in greater detail.
Feature toggles can be more complex than the story
I have often faced the situation where I wanted to toggle a feature which was bound to Java or Spring Boot annotations. Since these annotations are often used as syntactic sugar, to simplify away a lot of otherwise verbose code, introducing feature toggles changing the behavior of the annotations (or which annotations are used) requires to explicitly write out this verbose code and then inject the feature toggle in there. This can be very frustrating and cumbersome for small changes, especially including tests and later clean up. To avoid this, figure out early on whether a feature toggle for the annotation makes sense or can be moved out to a different location instead. Also, when writing new code, consider designing it in a way that it becomes both easy to read, maintain, extend, and at the same time easy to introduce new toggles.
Similarly, when fixing bugs, it is sometimes discussed whether a feature toggle should be created for the ticket. On the one hand, a toggle can help prevent that the fix accidentally worsens the situation. On the other hand, with the bug there is already a problem which needs to be addressed as quickly as possible. Creating a toggle around bugged behavior in the code can be more complicated than the fix itself. Therefore, I feel it is sensible to decide case-by-case whether creating a toggle is needed for a bugfix instead of dogmatically always creating the toggle.
Read the rest of this series:
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.