Eight things that slowed down our backend workflow and what we did to improve

Chen Chen

Published: April 28, 2023

The more we want to make our product competitive, the more changes we will need to make and the more we’ll have to redeploy our backend services. While these changes are happening, most developers feel comfortable following their daily routine, while the tech aficionados will always chase the latest and discard old decisions.

Our project was to implement new features in the cloud (based on Amazon Elastic Kubernetes Service) and migrate legacy services to the cloud piece by piece. We encountered several challenges throughout this project. In this article, I would like to list eight things that slowed down our backend workflow and what we did to improve.

1. Set up the environment every time we want to have a new service node

When working on a legacy project, people commonly rely on IT operations to help set up different machines for different services when there’s a demand to scale up services. There are magic scripts for bootstrap that can help, but they might not work if the machine is not set up properly.

So, instead of doing routine work again and again, we dockerized our service.

Docker is a virtualization technique that helps encapsulate and deliver the entire service into packages called images, so we only needed one DockerFile as a configuration file, and our services will be managed across any operating system that is running with a docker runtime.

2. Tedious steps to deploy new features

There are still a lot of projects that are pushing binaries and jars to instance and then trigger the magic script to restart their workload. However, after we dockerized our application, the next time we needed to deploy a new feature we could just publish the image and trigger a simple docker or docker-compose command to restart the docker container.

Then we had the option to use Kubernetes to manage our containers. Gitops is also a good way to deploy services.

In order to pursue a stable rolling out of our service, we also tried implementing graceful shutdown for running pods.

3. Duplicate implementation of authentication in every service

Every time we wanted to extend our function, we were supposed to create and connect to another service. Micro services or monolithic won't change the fact that we need to build mutual authentication between different services.

There are different ways of implementing authentication. In my projects, the most common way has been to use a signed JWT token and do public key signature check on each service, as well as some other embedded payload check, such as audience and other customized properties.

One of our clever ideas is to use an owned universal login system and a well-encapsulated shared library to handle the verification of JWT tokens.

Furthermore, since we were already using kubernetes, Service Mesh was very useful. We also tried to use some existing services that Istio provides, such as Authentication Policy, to delegate our token verification work.

4. Complicated logic for service failure retries

Every outgoing request in backend service may fail, and not just because the IO exception is a checked exception on some platforms.

Most of the time there will be hundreds of error responses when the downstream requests fail, but sometimes we want to increase the possibility to make our service work and retry the failure request.

Defining a pattern of retry in the system isn’t easy, and will be even more complicated when we try to have a backoff logic once it fails again.

For those projects where we use advanced programming languages, such as Kotlin, things become easier with syntactic sugar to simplify our logic. We can make a runIO and retry tool function to support.

For our projects using Kubernetes, Service Mesh saved us time. All the network stuff was already part of the service mesh infrastructure layer of work. With Istio, we could easily define the retry policy out of the program and make things look much neater and discoupled.

5. Messy management for system properties and feature toggles

Each service will have a lot of configurable variables for system properties and feature toggles rather than hard coded values within your application. The delivery of those variables can be really frustrating.

Putting the configure file in each repo is a quick win, but we don’t want to rebuild our application each time when only some configurations change. Some projects are using automatic infrastructure tools to manage properties in one place, such as Puppet. It still involves too much IT operations work inside.

Since we have our service dockerized, we take the advantage of volume to manage our configurations, so that can be well decoupled.

For those services where we are using Kubernetes, we moved all properties and toggles into the environment variables of k8s deployment. Applying new configurations will be a normal gitops deploy.

6. Difficulties of managing cloud infrastructures

Many of our projects have already been migrated to the cloud. With some big projects, we have even built a huge empire of infrastructure in the cloud.

It’s impossible to manage all the computing resources just by web console.

We used some platform original technologies to manage our resources, then switched to using terraform as our first option to manage infrastructures.

For some projects where people would prefer to use imperative code rather than declarative code, pulumi or terraform ckd can be good options.

7. Hard to do API and contracts tests across different micro services

We couldn't tell when the contract will be changed for upstream or downstream. Someone suggested having a PACT contract test but we gave up in the beginning to boot our broker server…

We defined our contract via openAPI and generated the client and server by our demand, the openAPI definition can be also defined remotely so that client and server could share the same contract that we could write a test with.

We also customized the openAPI generator logic by modifying mustache templates.

8. Follow too many best practices and get stuck in the beginning

At the beginning, I mentioned that some developers always like to try the latest and what is called best practice.

But should we tear down legacy and embrace new things all the time? No.

New things might be fancy but we must recognize the learning curve and effort and risk of migration. We must compromise at points and take the most valuable changes and advance step by step.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights