Kubernetes is fast becoming the operating system for the Cloud, and brings a ubiquity which has the potential for massive benefits for technology organisations and developers. Kubernetes has seen a dramatic increase in visibility and adoption throughout 2017, with all major cloud providers now offering their own native Kubernetes service, and several container orchestration platforms rebuilding with Kubernetes as an underpinning. Here, we’ll take you through some of the things which we’ve found interesting, challenging, or exciting about using Kubernetes over the past year — we hope you’ll find it useful.
Starting with Clusters
I’ve heard about this Kubernetes thing, but how and where do I get started?
For most people starting out on Kubernetes, creating a Cluster is the first thing you’ll be doing. In early 2017, if you wanted a Kubernetes Clusters your options were: use Google’s managed Kubernetes service, or build your own. Thankfully, since then several new tools and scripts have emerged to make your life easier. For us, the introduction of Kops was a game changer. Kops has made spinning up a Kubernetes Cluster in AWS a breeze.
With only a few commands, Kops was able to create an entire auto-scaling Cluster which we’ve been comfortably nurturing for the past year. Kops even generated a set of Terraform templates for our Cluster, which we were able to drop into our existing infrastructure automation codebase and manage it the same as any other piece of our infrastructure.
In the time we’ve had a Kubernetes Cluster we’ve upgraded it from version 1.4 up to the current version. Kops managed all the changes with it’s rolling update feature which migrated our Cluster server-by-server, gradually introducing the new version until everything was reporting back healthy. Thanks to Kops we’ve been able to keep up-to-date with Kubernetes releases, which has allowed us to use all the latest features as they’ve become available. We’ve seen the rise and fall of PetSets, the introduction of CronJobs, Node affinities, init-containers, and many other interesting features without breaking a sweat from upgrading.
Despite the assistance of Kops, setting up our Cluster wasn’t all roses. We encountered a number of issues, many self-inflicted as we learned but also a few from Kops and Kubernetes themselves. For example, Kops’ rolling update feature is fantastic but still in its infancy. We had several problems with Cluster upgrades ending prematurely, resulting in multiple versions of Nodes running. On several occasions Pods were redistributed to other Nodes while an upgrade is occurring, and weren’t rebalanced back after the upgrade was complete, which left us with some stressed Nodes and others underutilized. Many of these could have been mitigated by recent improvements in Kops and careful capacity planning.
With Azure, Google Cloud, and AWS all now offering or having announced a managed Kubernetes service, creating a Cluster in the near future should be trivial and have a much lower barrier to entry than it has historically. A Kubernetes Cluster will soon be as trivial to create as any other managed cloud service. Kops will still be a compelling tool for organizations which need a self-managed Cluster, but increasingly that’ll become a niche market.
Developing for and on Kubernetes
Now I’ve got a Cluster, how does my team start using it and deploying their applications?
The Kubernetes team and community have done a great job of making the developer experience of working with Kubernetes smooth and painless. The project has an ethos of community engagement and rigorous documentation, meaning that as a newcomer you’ve not only got a wealth of information, tutorials, and workshops to consume, you’re also embraced by a friendly and helpful community. It’s difficult not to quickly understand and embrace Kubernetes and everything which comes with it. We both feel Kubernetes’ documentation and approach to managing change is one of its big strengths.
People wanting to try Kubernetes can use Minikube to create a tiny self-contained Cluster on their local machine and deploy to it like any other Kubernetes. Kubernetes and Minikube can have significant impacts on the way developers work. We’re seeing near-universal adoption of Docker everywhere, and with Minikube being a combination of local Kubernetes on top of Docker, your development workflow can mirror your production workflow even more closely than before. Docker images designed for running in your Kubernetes environment can be run locally in your Minikube without any changes. There’s a whole ecosystem developing around local workflows, with tools like Draft from Azure tackling some interesting challenges. Even Docker itself has added Kubernetes to its stack.
Once you’ve got your head around what it means to deploy something to a Cluster, all the Pods and Deployments and Services and more, you’ll be wondering how you can apply Continuous Delivery practices to your Kubernetes workflow. How do you frequently deliver, with idempotent releases, zero downtime, and with as much reuse and commonality between environments as possible?
Much like Kops, Helm was another game changer for us. Helm is a package manager, a deployment tool, and a repository of pre-configured applications. With Helm you can write your Kubernetes manifests as templated code which can be reused across environments by tweaking settings at deploy time with variables.
This is akin to infrastructure as code for Kubernetes, and gives us the repeatability and reuse we’ve come to expect from a well-managed infrastructure project.
We’re using Helm extensively to automate our deployments to Kubernetes, which are both real in-use systems and transient testing environments, all with the same code. We also rely and contribute to several of the open-source Helm charts for common applications, such as Concourse, Grafana, InfluxDB, and several others.
Cloud integration and agnosticism
I’ve created my Cluster, and I have an application running in there, but I’m concerned I now can’t leverage the full potential of my cloud. Can my Kubernetes applications still use native AWS/GCP services?
Kubernetes can sound like an island, out there on its own. You’re building software and putting it in a Kubernetes Cluster, but aren’t you isolating yourself from your cloud provider?
Kubernetes occupies quite an interesting space in the cloud infrastructure ecosystem, where it’s becoming an agnostic layer between the cloud platform you’re running on and your applications. This can sound like isolation, but Kubernetes can be both tightly integrated with cloud providers, while also enabling you to be much more cloud independent than you would be without it. This idea sounds a bit contradictory, so we’ll elaborate...
You have an application which needs a Load Balancer, as many do. Traditionally, you’d build and package your application, and then have CloudFormation or Terraform templates describing the environment to deploy the application into, which would contain a specific request for an AWS Elastic Load Balancer, for example. With Kubernetes, you specify a Load Balancer in its manifest but without explicitly requesting an AWS one. When you deploy the application to your Cluster, Kubernetes interprets your request for a Load Balancer differently, depending on which cloud provider your Cluster is deployed in. When your Cluster is in AWS an AWS Elastic Load Balancer is provisioned and automatically connected up to your container, but when you’re in Google Cloud you get a GCP Load Balancer instead, all seamlessly and without any changes to your deployment configuration.
Load Balancers are just one example. The pattern applies to external DNS records, long-lived persistent storage volumes, through to Cluster networking. Interestingly, this agnosticism opens up opportunities for running Clusters federated across multiple cloud providers for (super) high availability, and even running Kubernetes on bare metal servers.
Running systems in Kubernetes
Now I have applications running in my Cluster, how do I make sure they’re running ok?
With the amount of activity happening in a Kubernetes Cluster — Pod creation and destruction, scheduled cron jobs, frequent deployments — adequate monitoring becomes the key to understanding the current health of your Cluster. In addition, when you’re collecting an appropriate set of metrics, you can alert your teams when the system is unhealthy or requires attention. Similarly, due to the dynamic nature of Kubernetes, a centralized logging system is essential in order to help diagnose issues when they arise.
When we were first getting started, we found the Kubernetes Dashboard and Heapster metrics collection setup to be a great place to dive in to observing the behavior of our Cluster. Most Kubernetes Clusters will have the Kubernetes Dashboard installed by default, or with minimal effort it can be installed afterwards. The dashboard gives you a high-level overview of the health of your Cluster with a series of metrics about how much CPU and memory Pods are consuming, and how Node limits are being stressed. The Dashboard also has a limited set of administration capabilities, such as scaling Pods manually, viewing logs, executing commands in a container, which can get you surprisingly far for observing your Cluster. This Kubernetes Dashboard is another great example of how the project and community (the dashboard is community driven after all) embrace newcomers, by making potentially complicated information readily accessible with minimal effort.
Spending time just exploring the dashboard really helped us understand how things hang together and what was happening in our Cluster.
Once you outgrow the Dashboard, things really start to get interesting. Moving beyond the Dashboard to a great monitoring system really helped us understand how our systems were behaving and what things we could try to improve the health of the Cluster. Our Cluster evolved from Heapster and the Dashboard to a combination of a set of Grafana dashboards and Prometheus (both helpfully installed from Helm).
All scheduling in Kubernetes is managed via a central API, which turns out to be quite a compelling feature. Different subsystems can listen for activity in the Cluster and perform work on events, such as when a new Pod is started. We use this capability to have our Prometheus monitoring system automatically detect when new applications are deployed into our Cluster. This enables us to start collecting metrics and visualizing their behavior immediately.
We combined this with some simple configurations of alerting to keep us notified of when any systems started to misbehave, all without the need for human intervention. For example, we were able to use this setup to diagnose when heavy workload Pods were starving other Pods of resources, and were then able to set more appropriate resource limits and configure Node affinity rules to schedule heavy work on powerful dedicated Nodes. Similar to metrics, we also had Fluentd automatically scraping logs from every Pod and forwarding them onto our ElasticSearch Cluster.
That said, with automatic monitoring and logging comes some cost. There’s actual monetary cost: you’re likely to now be forwarding a whole lot more logs and metrics than you ever were before, which tends to add up. Then there’s a cognitive cost of collecting all this data, and a serious danger of over-alerting and causing fatigue or complacency in your teams responding to the alerts (HumanOps is an important consideration here). We advise being very careful about which alerts you choose to send as notifications; err on the side of under-alerting to begin with. Our practice is to send all alerts to a Slack channel which is checked periodically but remains muted for most people. Once we’ve tuned the alerts which are going to that channel sufficiently they get promoted to the “real” alerts channel and to other notification tools.
These tools being auto-wired is an incredibly powerful concept. It has the potential to revolutionize the way we structure and deploy applications. Kubernetes uses the Sidecar terminology to refer to a collection of supporting containers which can be co-located with the applications. Instead of each application having to explicitly request or include its own set of monitoring libraries, sidecars can be automatically hitched to your application and start adding value without adding additional complexity to your development workflow. Some examples of sidecars which we’ve used have already been mentioned — Prometheus monitoring and Fluentd logging. But we also use containers which act a reverse proxies, others which automatically generate TLS certificates, and others which do periodic tasks like running database migrations. Sidecars are a super interesting concept and can have significant impacts on how you design your system.
The future: is it serverless?
Ok, this is all very interesting, but isn’t it all a waste of time given Lambda and serverless are the future?
Where do we go from here? There’s growing excitement in the industry about “serverless” and Lambda-based architectures, and some would argue that Kubernetes is moving in the opposite direction.
Kubernetes may sound like it’s at odds with serverless approaches to system design; big Cluster of servers versus “no servers”. However, Lambdas do get scheduled somewhere, and it’s within a Cluster of servers, just not a Cluster of servers you control. Not every organization’s going to be comfortable with giving up that control, and not every workload’s going to be compatible with the constraints of Lambdas.
Kubernetes has the opportunity to be the servers of the serverless world. We’re already seeing with tools like Kubeless and Fission providing equivalents to functions-as-a-service but running within Kubernetes. These tools aren’t a drop in replacement for Lambda yet, as they lack a lot of the native integrations which make Lambda’s so powerful. But they’re an interesting example of how it doesn’t have to be an either-or situation, and how you can potentially avoid some of the vendor lock-in which currently comes with Google Cloud Functions and AWS Lambdas.
The most exciting advancement in this space, in our opinions, is AWS Fargate. Details are still scarce on exactly how it will work, but what we know so far is that Fargate is to containers what Lambda was to functions. Instead of having to squash or break up your codebases to work within the constraints of a Lambda (which isn’t always a bad thing but can be if you’re only doing it to satisfy Lambda!), you can deploy a container into Fargate and have it be managed for you with a similar cost model to Lambda. Why is this so interesting? In early 2018 we’ll be seeing Fargate running on top of Kubernetes in AWS EKS. The flexibility and low overhead of serverless applications, optionally running on a managed or self-managed Kubernetes Cluster, presumably with all the extra perks.
If you haven’t already guessed, we’re pretty excited about Kubernetes. In the year or so we’ve been running our Cluster we’ve seen it go from strength to strength, with new versions being released frequently containing significant improvements and exciting new features, and meanwhile more and more vendors are adopting it. We think in the next year, Kubernetes is going to be everywhere, and we’ll start seeing even more exciting technology being built on top of it thanks to everyone having extra capacity from not trying to re-invent the basics. Mesh networks, multi-region Clusters, multi-cloud Clusters, serverless reimagined, it’s all very exciting. Watch this space.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.