Setting up ‘production-issue alerts’ made my life easier. Here's how.

Ashwitha B G

Published: February 23, 2021

Large companies can have hundreds of microservices - some on Virtual Machines (VM) and others on Kubernetes, where new features are built and deployed every day. And, without efficient monitoring, it becomes hard to identify any specific application or critical API that’s failing.

Let’s take the complaints automation system in a ride-hailing application, for instance. This system automatically solves complaints raised by the drivers and customers – losing belongings in the cab, or the driver not picking up the customer etc.

Complaints automation system

When not monitored, bad things get worse

The above mentioned ride-hailing app could be going through deployments involving database migration changes and major code rewrites. Now, if all possible scenarios have not been tested, production deployment could break down.

This could put automation on pause for several hours and all complaint tickets will have to be redirected for manual verification. Without appropriate monitoring, it falls to agents to report the production issue to the company.

Automated monitoring

By capturing the above described event and similar in the automation service, we could:

Keep track of the issues being automated
Capture statistics for how many complaints were successfully resolved and how many failed (with 4XX or 5XX) errors
Keep track of response-time for database queries and API requests
Carefully monitor critical APIs by adding alerts for every single failure
Follow system-related metrics like disk space, memory, CPU usage, etc.

With the necessary monitoring in place, production issues can be fixed that much quicker, and we will also know how well or not new features are performing on production and on deployment.

Complaints automation system with monitoring setup

Building effective monitoring systems

A monitoring system consists of metrics, monitoring and alerting.

To build one's own monitoring system, one would need a collector to collect metrics, a store to store metrics, a visualizer to set up dashboards, and an alerter to alert when something goes wrong. There are multiple ways we can monitor applications and systems.

This blog discusses the use of the TIG (Telegraf, Influx and Grafana) stack, an end-to-end open-source solution for monitoring applications. It has three components - Telegraf for collecting metrics, Influx database for storing data and Grafana for visualization and alerting.

Why TIG stack?

There are a lot of other monitoring systems in the market like Prometheus, Datadog and more. Ideally, choosing between these monitoring systems should depend on the scale of the task at hand, if the system is open source, if push or pull-based monitoring is a requirement etc.

For instance, Prometheus and TIG stack are open source. Also, Prometheus is a pull-based system and is known to work well for large scale requirements. InfluxDB is a push-based system and supports multiple data types.

In this article, I’m not comparing between monitoring systems but, just picking one system i.e TIG stack to explain how monitoring systems work.

Telegraf

Telegraf is a metrics collecting agent and is optimized to write to the Influx database. It runs on a VM or as a pod or as a sidecar on the Kubernetes cluster that can output metrics. It is written in Go and compiles into a single binary with no external dependencies.

It’s plugin driven and supports collections of metrics from 100+ popular services by using plugins. The four types of plugins include -

Input plugins are used to collect metrics from systems, services and third party APIs. For example, Postgresql plugin is used to get metrics from the Postgres database
Output plugins are used by Telegraf to write the metrics to various sources. For example, InfluxDB output plugin sends metrics to influxDB
Aggregator plugins are used to create aggregate metrics. For example, Merge is used to merge multiple metrics and generate influxdb line protocol
Processor plugins are used to transform, decorate, and filter metrics. For example, Regex plugin transforms data based on regular expressions

It’s extremely easy to add a plugin in Telegraf. Here’s the image of configuration needed to add a 'mem' input plugin which is used to get metrics for memory usage. This configuration is written in its configuration file.

#Read metrics about memory usage

[[inputs.mem]]

Telegraf can work on both pull or push-based models and has plugins for pulling and pushing metrics.

In the pull-based model, monitoring agents pull the metrics from systems periodically. They pull data from targets, format the metrics into influxDB line protocol, and send them off to influxDB.

In the push-based model, metrics are pushed to the monitoring agent. Telegraf sends the metrics from a system like a database running in VM, and will also pull data from the VM using plugins like cpu, mem. To receive metrics from an application, we use statsD plugin which follows a push-based model.

StatsD

StatsD is a simple daemon to collect and aggregate application metrics and consists of the client, server and backend.

In our application code, we invoke the statsD client to send metrics to the statsD daemon which runs on Telegraf. There are language-specific libraries available for statsD clients. For example, Ruby has a statsD client called statsd-instrument.

A statsD server aggregates metrics by default for 10 seconds and flushes the metrics to the backend like an influx database.

StatsD client communicates with the statsD server using the UDP protocol - fire and forget. Our code does not wait for a response, making it faster. StatsD server pushes metrics to the backend chosen by the project.

<metrics_name>:<metrics_value>|<metrics_type>

example:
ticket.automation.time:100|ms

Metrics name is also a bucket. Metrics value is the number associated with the metrics name. And, metrics type could be one of the following:

Timers measure the amount of time taken to complete the task
Counters determine the frequency at which the event is happening. One could increase or decrease the counter
Gauge takes arbitrary value. For example, we could have active database connections

Influx database

Influx database is a time-series database and has a retention policy feature to automatically delete data. Additionally, it’s easy to learn because of its SQL-like query language called InfluxSQL.

InfluxDB line protocol is the text-based format for writing points (or a single data record) to the database. It’s the text-based format that provides measurement, tag set, fieldset and timestamp.

Image iconInfluxDB line protocol where the table name is called measurement, indexed data is called tag set, and non-indexed data is called a field.

In an influx database, the table name is called measurement, indexed data is called tag set, and non-indexed data is called a field.

Grafana

Grafana is used to visualize metrics in the dashboard and to set up alerts. We can create the dashboard and graphs for metrics from data sources like the influx database, Prometheus, elastic search, etc.

We can also set a threshold for receiving alerts and they can be sent to slack, email, pager phones etc.

Dashboard in Grafana

The simplicity of a monitoring system that leverages TIG Stack lies in it’s ‘plug and play’ nature. Grafana can be replaced by Chronograf and Kapacitor. Similarly, we can use Prometheus instead of InfluxDB.

Also, Telegraf can collect metrics from different sources. What’s more, all the components are open source and are easy to install.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Industries

Publications and Tools

All Insights