Data quality - safety net for advanced analytics

Shweta Jain

Published: March 29, 2019

Your last Big Data investment has most probably run into a data quality wall, but you’ve managed to declare the investment a success. All while knowing that it could have been a bigger value add. Don’t worry because you’re not alone in this (mis)adventure.

Most companies only manage to utilize about 3% of their data when investing in Big Data analytics that heavily leverage data integration technologies. Let’s take a moment and discuss the biggest cause for the abysmal utilization - data quality issues. What is the nature of these issues and how many types are there? What challenges do these issues cause? And, can one prevent these data quality issues from rising up?

We begin by looking at the fundamental nature of Big Data - the diversity of data sources, especially in the context of enterprises is phenomenal. Data comes in different types, with various levels of complexities and structures that more often than not complicate processes and practices down the line. Data integration is an example of such a process/function that further affects data analytics and therefore the quality of downstream applications.

Ensuring data quality is one of the most powerful ways to get the most out of the aforementioned big data investments. And, data quality will have to be consistently maintained throughout all stages of the data journey for the best results:

5 V's of big data

Data profiling

Evaluating or checking for data accuracy as a beginning point requires an understanding of where data exists and combining that data in a way that’s consistent across different (and silo-ed) data sources.

Data capturing

Organizations like GlaxoSmithKline, GE, and Toyota are known to follow these steps when capturing data -

Automating data entry
Selective manual entry options
Great user interface and design principles
Instant data validation (for entered data)
Verification from source to target mapping
Rule-guided ingestion of only good data
Monitoring bad or noisy data
Fix data issues close to the source

For the errors that still persist, here are a few suggestions on handling such corrupt data:

Accept the error if integration falls within an acceptable standard. For example, accept the response if the answer to the question of ‘where do you work?’ is ‘Men’s Salon or Unisex Salon’ instead of ‘Salon.’
Reject the error particularly during data imports. Especially if the information is severely damaged/incorrect that it makes sense to delete the entry than try to correct it. An example could be transcripts of call center interactions that are usually very unstructured data.
Correct the error in cases of misspellings (of names) or a similar. For example, if there are variations of a name, you can set one as the ‘Master’ and retain the consolidated data to correct it across all datasets.
Create a default value if you don’t know the value. It’s better to have a value such as unknown or N/A rather than nothing at all.

In the case of large organizations, traditional approaches are not suitable when handling massive data volumes and variety. This popular checklist of the six primary dimensions for data quality assessment is commonly used by members of the global data communities when dealing with such errors from such enterprise-level data:

Six dimensions for data quality assessment

Data curation

Finally, armed with data from various sources, we need to curate the data before we combine it. The ICPSR states that “Through the curation process, data are organized, described, cleaned, enhanced, and preserved for public use, much like the work done on paintings or rare books to make the works accessible to the public now and in the future.”

When at the data curation stage, organizations might want to reduce dependency on human intervention, for which they will utilize ML to better understand consumers, AI and deep learning to recognize engagement and buying patterns and apply their learning to evolve algorithmic behavior that further strengthens effective learning.

The data curation space sports a few popular tools. One is Tamr that focuses on a bottom-up, ML approach to unify disparate, dirty datasets. The platform’s advanced algorithms automate as much as 90% of decisions taken. The other tool, Drunken-data-quality is a small library that checks constraints on Spark data structures. It can assure certain data quality, especially in cases of continuous data imports.

Alongside these tools, here are a few helpful hints that should be observed at this curation stage:

Focus on data usage, i.e. try to make data producers consume their own data (for dashboards or automated KPIs) because then, the producers will be inclined to notice bad data quality. Furthermore, a best practice is to automatically delete data (with a warning) that is not used by anyone for an extended period of time, reducing the amount of data that require quality checks.
Appoint clear data owners for each data stream or data set. This person/team will attest to data correctness or not. If a data set has no owner, delete it.
Do not curate all data to a central place, instead consider local data lake shores or data marts to provide use-case specific curation.

In conclusion, while the amount of time and energy spent on cleaning up ‘dirty and disconnected’ data is excessive, it’s of paramount importance because it affects analysis that actually gives organizations their actionable insights. For better alignment of data analysts’ time, data quality needs to be on par right at the beginning. This will create an atmosphere of trust in organizational data and its analysis and forthcoming insights.

A version of this article appeared in CIO.in and ComputerWorld.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Industries

Publications and Tools

All Insights