Enable javascript in your browser for better experience. Need to know to enable it? Go here.

Data quality: the Achilles heel of data products

Data quality is a challenge that needs to be addressed in most data products; failing to do so can have serious potential consequences. Missing values, for example, can lead to failures in production systems, while incorrect data can lead to the wrong business decisions being made. In machine learning, changes in data distribution can undermine the performance of your models; in the context of recommender systems, this could lead to a poor customer experience and a hit to your revenue. In sectors such as healthcare, the consequences can be far more significant. It can lead to mistreatment and misdiagnoses. Prescription errors, as an example, are not only costly (the Network for Excellence in Health Innovation estimates $21 billion a year), but they are also believed to have been the cause of more than 7000 deaths annually in the US.


Fortunately, data quality frameworks can help us minimize the risks of poor quality data. They not only help us identify issues early, they can also do so in an automated and repeatable way. Although it’s possible to develop your own data quality framework, it can be complex and time-consuming. And with a wealth of open source and commercial frameworks that do a good job of addressing most needs, it makes sense to use what is already available. However, there are a number of different data quality frameworks to choose from. Here, we’ll look at how you can choose a framework that suits your project.

 

 

Figuring out which framework will fit the needs of your data product

 

Choosing the data quality framework that best fits your project largely depends on context — where you are now and what you’re trying to accomplish. Below are a few common scenarios which will help demonstrate how you might go about selecting a framework.

 

 

Scenario 1: Assessing an inherited legacy data product quickly

 

You’ve inherited a large potential data product; however, there are not yet any test cases and you have no understanding of its quality. You need to determine this quickly and then evolve the tests together with the domain experts.

 

The most important features for your data quality framework for this scenario are profiling, automated test creation and a user interface (UI) that enables collaboration with domain experts to speed up the understanding of gaps in testing and the quality of current data itself.

 

 

Scenario 2: Monitoring spikes or drops in time-series data volume

 

You’re building a time-series-based data product — say for example, user analytics events from a mobile application. You need to identify suspicious spikes or drops in the volume of your events data that might indicate data quality issues.

 

The most important feature for this scenario is anomaly detection on data volumes over time.

 

 

Scenario 3: Daily alerts that data quality standards are being violated

 

You have a data product that contains complex data points with lots of attributes which change every day (like a daily incremental import from a CRM, for example). This means you need to check that your predefined data quality standards are being met every day, ensuring that attributes are not null and that all attributes are within valid data ranges.

 

The most important features for this scenario are constraint tests (rule-based tests) and a Slack integration that can provide an alert when standards are being violated. 

 

 

Scenario 4: Domain-specific quality standards

 

In some domains, you need your data product to meet specific business rules. For example, in a data product dealing with sales data, there might be a requirement that a sale date should never be a future date.

 

The most important feature for this scenario is the ability to write custom constraints for your quality checks. 

 

 

Scenario 5: Discoverable data quality for large data organization 

 

You are in an organization that has multiple data teams organized around data products. You want to make sure your data product consumers are informed about the data quality for each data product before they start using it for their analytics use cases allowing downstream consumers to easily determine whether the current quality levels are sufficient to build towards their use cases. There could be different levels of quality demanded by the type of use cases that the data would be used for.

 

The most important feature for this scenario is ease of data catalog integration to enable automated publication of data quality check results into the catalog.

 

 

Feature matrix

 

There are many open-source data quality frameworks that are very good – we’ve had positive experiences in client projects with Great Expectations, Deequ and Soda Core. They can all help you implement data quality tests through a range of features. Depending on the level of integration you require, some key features to consider are captured below:

 

Feature

Great Expectations

Deequ

Soda

Core Features

Licensing

Open-source

Open-source

Soda Core: Open-source


Soda Cloud: Commercial

Out-of-the box constraint-based tests for common constraints

Yes

Yes

Yes

Primary language of development

Python

Python/Scala

Soda Checks Language (SodaCL) in Python

Hard dependency on any frameworks

No

Yes.

Hard dependency on Spark as it cannot be executed outside of a Spark cluster.

No


Quality Metrics Visualization

Partial

No

Yes. In-built integration with popular dashboarding tools. Also has a Reporting API on Soda Cloud.

Support for data validation of incremental data

Yes

Yes

Yes. Available via Soda Cloud metrics store.

Stateful Metrics Computation

No

Yes Stateful Metrics Computation - awslabs/deequ · GitHub.

Available via Soda Cloud metrics store.

Automation Features

Data Profiling

Yes

Yes. More powerful than the other 2 - richer in-built profiling functions and customizable.

Yes

Test Creation

Yes

Partial

No

Constraint Suggestion

No

Yes

No

Anomaly Detection

No

Yes

Yes. Available via Soda Cloud.

  In-built Integrations

Orchestration Solutions

Yes

GreatExpectationsOperator

No

Yes via

In-built integrations

Apache Spark

Yes

Yes

Yes via Soda Spark

Slack Alerting

Yes 


No

Yes

Ticketing

No

No

In-built integrations

CI/CD

Yes. Integrates with Github Actions

No. Need to write a custom step within a CI/CD pipeline and use the test results.

No. Need to write a custom step (Soda Scan) using the Soda CLI within a CI/CD pipeline and use the result of the scan.

Data Catalog

No

No

Yes. In-built integrations

What is the right framework for your project?

 

The scenarios you define for your use case (including the ones we discussed) will help in identifying features you need to look at for comparing different data quality frameworks.

 

For example, in scenario 3, we need to create an alert for any time data quality standards are violated. We identified that this means we require constraint-based tests and a Slack integration. So, looking at the feature matrix, we can see that all three evaluated frameworks provide constraint-based tests; Great Expectations and Soda also provide integrated Slack alerting. 

 

In scenario 2, where we need to monitor volume spikes and drops in analytics event data, we identified that anomaly detection is particularly important. In our feature matrix, we can see that Deequ provides solid support for anomaly detection, as does the commercial version of Soda. 

 

In summary, by starting with the data product quality criteria, we can then consider which framework is going to be best suited to our context and needs by using the feature matrix above.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Keep up to date with our latest insights