How much can you trust your data?

Ellen König

Published: July 23, 2020

Data is the fuel for intelligent decision making for both humans and machines. Just like high quality fuel ensures that jet engines run efficiently and reliably in the long run, high quality data fuels effective and reliable decision making.

Whether it is for decisions taken by corporate executives, frontline staff or intelligent machine learning models, any intelligent enterprise needs high quality data to operate. But unfortunately, data quality issues are very widespread. In a survey conducted by O’Reilly Media in November 2019, only 10 percent of responding companies stated that they do not face data quality problems.

Why does data quality matter so much?

Let’s have a look at three typical data case studies from different Thoughtworks engagements:

Corporación Favorita, a large Ecuadorian-based grocery retailer, needs to predict how much of a given product will sell in the future, based on historical data. (Thoughtworkers participated in the linked Kaggle competition.)
A large German automotive company, Client Two, needs a product information system that allows their clients to configure the car they want to buy.
A large online retailer, Client Three, needs dashboards to track sales and logistics KPIs for their products.

Each of these cases depend on the involved data being of high quality. In the first case study, incomplete or unreliable data will lead to untrustworthy sales predictions, resulting in poor stocking and pricing decisions.

For our Client One, mismatch between the data in the product information system and the reality of what can be built in the factories currently can result in desired car configurations mistakenly not being offered. Or in cars being purchased that cannot be produced in the factory. Which will lead to worse sales, to customer frustration and possibly legal claims.

And for our second client, poor data quality will lead to company executives, sales managers, and logistic managers drawing incorrect conclusions about the state of the company’s operations. This could result in reduced customer satisfaction, loss of revenue, increased costs, or misdirected investments.

In all of these cases, low data quality leads to poor business decisions being taken, resulting in undesirable business outcomes such as decreased revenue, customer dissatisfaction and increased costs. Gartner reported in 2018 that surveyed organizations believed they, on average, lost $15 million per year due to data quality issues.

Efforts to address data quality therefore can help directly make companies more effective and profitable.

How good is your company’s data quality?

In a modern business, everyone works with data one way or another, be it producing, managing or using it. Yet like water for fish, we often fail to notice data because it is all around us and, just like fish in the water suffer from bad water quality, we suffer if our data quality decreases.

Unlike the fish in the water though, we can actually all contribute to addressing data quality issues and that process starts with assessing the current state of our data quality.

Making data quality measurable

Loosely following David Garvin’s widely referenced definition of quality in “Managing Quality” (1988), we can distinguish between three perspectives on data quality:

Data consumers: Usage perspective

Does our data meet our consumers’ expectations?
Does our data satisfy the requirements of its usage?

Business: Value perspective

How much value are we getting out of our data?
How much are we willing to invest into our data?

Engineering: Standards-based perspective

To which degree does our data fulfill specifications?
How accurate, complete, and timely is our data?

To make these perspectives more tangible, we can define data quality dimensions for each of these perspectives. A data quality dimension can be understood as “a set of data quality attributes that represent a single aspect or construct of data quality”. For example, a dimension associated with the usage perspective could be the “relevance” of the data, for the value perspective the “value added” by a data product and for the standards-based perspective the “completeness” of data points.

Based on the dimensions, we can create specific metrics to measure the quality for our chosen dimensions. Once we know how good our data quality is for those dimensions, we can design specific improvement strategies for each dimension.

Automating the assessment of data quality

Assessing data quality can be a labor intensive and costly process. Some data quality dimensions used in practice can only be assessed with expert human judgement, but many others can be automated with a little effort. An early investment in automating data quality monitoring can pay continuing dividends over time.

Dimensions that can be measured at data point level include the accuracy of values and the completeness of field values. At dataset level, they include completeness of the data set, uniqueness of data points, and the timeliness of data.

Dimensions that require human judgement usually require additional context or subjective value judgement for assessment. Some examples for these dimensions are: Interpretability, ease of understanding, security.

For those dimensions that we can assess automatically, we can make use of two different validation strategies: Rule-based checks and anomaly detection.

Rule-based checks work well whenever we can define absolute reference points for quality. They are used for conditions that must be met in any case for data to be valid. If these constraints are violated, we know we have a data quality issue.

Examples are at the data point level are:

Part description must not be empty
Opening hours per day must be between 0 and 24

Examples on the dataset level are:

There must be exactly 85 unique shops in the dataset
All categories must be unique
There must be at least 700,000 data points in the dataset

Anomaly detection works well whenever we can define data quality relative to other data points and is defined as "the identification of rare items, events or observations which raise suspicions"(Wikipedia). It is often used for detecting spikes and drops in time series of metrics data.

An identified anomaly only tells us that there might be something wrong with the data, which might arise from the data quality issue, or it might be based on an outlier event recorded in the dataset. A detected anomaly should therefore be used as an investigation point for figuring out what happened.

Examples for anomaly based validation constraints are

The number of transactions should not change more than 20% for each day
The number of car parts on offer should only be increasing over time

With our data quality dimensions and derived metrics and the two different strategies for automated data quality validation, we now have all the pieces we need to implement validation with a data quality monitoring tool.

Case study: Assessing data quality with deequ

Deequ is a Scala library for data quality validation on large datasets with Spark. It is developed by AWS Labs. Based on our experiences, we recommend the library on the Thoughtworks Tech Radar for organizations to “assess”.

We recently used deequ at the online retailer introduced as Client Three. The data quality gates implemented with deequ prevent bad data from feeding forward to external stakeholders.

The library provides both rule-based checks and anomaly detection. Validation can be implemented with a few lines of code. Here is an example for a rule-based check:

val verificationResult = VerificationSuite()

  .onData(data)

  .addCheck(Check(CheckLevel.Error, "Testing our data")

           .isUnique("date")) // should not contain duplicates

  .run()

if (verificationResult.status != CheckStatus.Success) {

  println("We found errors in the data:n")

}

What is happening here:

We create an instance of the core validation class VerificationSuite. We can chain all operations needed to define our validation as method calls to this object.
We configure the data set we want to run our validation on
We add a uniqueness check as the validation we want to use
We run the validation
We check whether the validation succeeded. If not, we can input an alert on this failure. In the example, we are just printing an error message, but we could also log a message, trigger our monitoring system, trigger a notification etc.

A validation using anomaly detection can be implemented with just a few extra lines of code:

val verificationResult = VerificationSuite()

  .onData(todaysDataset)

  .useRepository(metricsRepository)

  .saveOrAppendResult(ResultKey(System.currentTimeMillis()))

  .addAnomalyCheck(RelativeRateOfChangeStrategy(

maxRateDecrease = Some(0)),

                        Size())

  .run()



if (verificationResult.status != Success) {

  println("Anomaly detected in the Size() metric!")

}

Here, the main difference is the addition of a repository. As anomaly detection involves comparing the current metrics to a previous state, we need to store and access the previous state of the metrics. This is handled by the repository. The anomaly detection itself is configured very similar to the static rule check by calling “addAnomalyCheck”.

Deequ provides a lot of different metric analyzers that can be used to assess data quality. They operate on columns of the dataset or the entire dataset itself and can be used for both rule- and anomaly-based validation.

For example, for the completeness dimensions, there are analyzers to analyze the completeness of fields and the size of the dataset. For the accuracy dimension, we could use the various statistical analyzers deequ provides and describe the data properties we need.

In our project, we found deequ to be worth exploring further. Some of its strengths are

Fast execution of rule check and anomaly detection steps
Validation can be implemented with very little code
Lots of metric analyzers to choose from
The library code is fairly easy to understand when you need to dig deeper than the documented examples
Code and documentation are under very active development

However, as of this date, it is not yet a fully mature project ready for any production use case. We found that the documentation is still incomplete for concepts and examples beyond basic usage. This hinders the implementation of more complex data validation. It could even lead to a faulty implementation of your validation due to misunderstandings that results in incorrect data quality assessments. An example we encountered during our work for Client 3 was implementing uniqueness checks with composite primary keys. One subkey had a low cardinality, which resulted in issues.

As the documentation seems to be under active development, it seems to us like a promising project overall.

Challenges in modern data quality assessment

Tooling, however, is only one of the challenges for effective data quality assessment. I see three other areas as big challenges:

Detecting data quality issues as close to their source as possible. The same as for software defects, the earlier we can detect data quality issues, the easier and cheaper it is to fix them. In a typical data pipeline, a data point will be combined, aggregated and otherwise transformed several times. Each transformation step multiplies the effort required to detect and trace quality issues. Coordinated data quality gates should therefore be implemented along the entire production pipeline of a data product. Ownership of data quality needs to reside with each data product owner along the pipeline.

Identifying the most impactful data quality issues with relevant validation scenarios. The most impactful data quality issues are those with the biggest effect on the business. Quality gates therefore need to be defined less by what is technically easy to validate, and more by the usage scenarios for the data product. Defining those quality scenarios requires, not only a good understanding of the data, but most importantly a strong understanding of the business domain.

Complementing automated validation with manual validation efficiently. As mentioned above, only some of the desired data quality dimensions can be assessed with automated validation. Depending on the quality scenarios, we might need additional manual validation. Manual validation usually involves more effort and is not as easily repeatable. Therefore, we need to figure out in which cases manual validation is really required and how to integrate it efficiently into the release process for a data product.

Where should you start assessing your data?

In typical organizations with lots of data sets, assessing all of your data products will be overwhelming. To define priorities, you could ask yourself:

Which KPIs are most sensitive to data quality concerns?
Which data that we provide to customers or partners is essential in core business processes?
Which intelligent services are embedded in core business processes?

The data products associated with each answer are what you need to look at first. To figure out how trustworthy these data products are, start by assessing them in their most refined form (right before they are used). This will give you a high level picture of your organization’s most relevant data quality issues. Armed with these insights, you can decide in which areas to focus your data quality improvement approaches.

Overall, data quality assessments are an effective, but often overlooked way to make your company’s data products more trustworthy. Detecting and fixing data quality issues could help you reduce costs, increase customer satisfaction, and improve revenue, which will ultimately contribute to your company’s overall performance.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Industries

Publications and Tools

All Insights