ThoughtWorks
  • 联系我们
  • Español
  • Português
  • Deutsch
  • English
概况
  • 工匠精神和科技思维

    采用现代的软件开发方法,更快地交付价值

    智能驱动的决策机制

    利用数据资产解锁新价值来源

  • 低摩擦的运营模式

    提升组织的变革响应力

    企业级平台战略

    创建与经营战略发展同步的灵活的技术平台

  • 客户洞察和数字化产品能力

    快速设计、交付及演进优质产品和卓越体验

    合作伙伴

    利用我们可靠的合作商网络来扩大我们为客户提供的成果

概况
  • 汽车企业
  • 清洁技术,能源与公用事业
  • 金融和保险企业
  • 医疗企业
  • 媒体和出版业
  • 非盈利性组织
  • 公共服务机构
  • 零售业和电商
  • 旅游业和运输业
概况

特色

  • 技术

    深入探索企业技术与卓越工程管理

  • 商业

    及时了解数字领导者的最新业务和行业见解

  • 文化

    分享职业发展心得,以及我们对社会公正和包容性的见解

数字出版物和工具

  • 技术雷达

    对前沿技术提供意见和指引

  • 视野

    服务数字读者的出版物

  • 数字化流畅度模型

    可以将应对不确定性所需的数字能力进行优先级划分的模型

  • 解码器

    业务主管的A-Z技术指南

所有洞见

  • 文章

    助力商业的专业洞见

  • 博客

    ThoughtWorks 全球员工的洞见及观点

  • 书籍

    浏览更多我们的书籍

  • 播客

    分析商业和技术最新趋势的精彩对话

概况
  • 申请流程

    面试准备

  • 毕业生和变换职业者

    正确开启技术生涯

  • 搜索工作

    在您所在的区域寻找正在招聘的岗位

  • 保持联系

    订阅我们的月度新闻简报

概况
  • 会议与活动
  • 多元与包容
  • 新闻
  • 开源
  • 领导层
  • 社会影响力
  • Español
  • Português
  • Deutsch
  • English
ThoughtWorks菜单
  • 关闭   ✕
  • 产品及服务
  • 合作伙伴
  • 洞见
  • 加入我们
  • 关于我们
  • 联系我们
  • 返回
  • 关闭   ✕
  • 概况
  • 工匠精神和科技思维

    采用现代的软件开发方法,更快地交付价值

  • 客户洞察和数字化产品能力

    快速设计、交付及演进优质产品和卓越体验

  • 低摩擦的运营模式

    提升组织的变革响应力

  • 智能驱动的决策机制

    利用数据资产解锁新价值来源

  • 合作伙伴

    利用我们可靠的合作商网络来扩大我们为客户提供的成果

  • 企业级平台战略

    创建与经营战略发展同步的灵活的技术平台

  • 返回
  • 关闭   ✕
  • 概况
  • 汽车企业
  • 清洁技术,能源与公用事业
  • 金融和保险企业
  • 医疗企业
  • 媒体和出版业
  • 非盈利性组织
  • 公共服务机构
  • 零售业和电商
  • 旅游业和运输业
  • 返回
  • 关闭   ✕
  • 概况
  • 特色

  • 技术

    深入探索企业技术与卓越工程管理

  • 商业

    及时了解数字领导者的最新业务和行业见解

  • 文化

    分享职业发展心得,以及我们对社会公正和包容性的见解

  • 数字出版物和工具

  • 技术雷达

    对前沿技术提供意见和指引

  • 视野

    服务数字读者的出版物

  • 数字化流畅度模型

    可以将应对不确定性所需的数字能力进行优先级划分的模型

  • 解码器

    业务主管的A-Z技术指南

  • 所有洞见

  • 文章

    助力商业的专业洞见

  • 博客

    ThoughtWorks 全球员工的洞见及观点

  • 书籍

    浏览更多我们的书籍

  • 播客

    分析商业和技术最新趋势的精彩对话

  • 返回
  • 关闭   ✕
  • 概况
  • 申请流程

    面试准备

  • 毕业生和变换职业者

    正确开启技术生涯

  • 搜索工作

    在您所在的区域寻找正在招聘的岗位

  • 保持联系

    订阅我们的月度新闻简报

  • 返回
  • 关闭   ✕
  • 概况
  • 会议与活动
  • 多元与包容
  • 新闻
  • 开源
  • 领导层
  • 社会影响力
博客
选择主题
查看所有话题关闭
技术 
敏捷项目管理 云 持续交付 数据科学与工程 捍卫网络自由 演进式架构 体验设计 物联网 语言、工具与框架 遗留资产现代化 Machine Learning & Artificial Intelligence 微服务 平台 安全 软件测试 技术策略 
商业 
金融服务 全球医疗 创新 零售行业 转型 
招聘 
职业心得 多元与融合 社会改变 
博客

话题

选择主题
  • 技术
    技术
  • 技术 概观
  • 敏捷项目管理
  • 云
  • 持续交付
  • 数据科学与工程
  • 捍卫网络自由
  • 演进式架构
  • 体验设计
  • 物联网
  • 语言、工具与框架
  • 遗留资产现代化
  • Machine Learning & Artificial Intelligence
  • 微服务
  • 平台
  • 安全
  • 软件测试
  • 技术策略
  • 商业
    商业
  • 商业 概观
  • 金融服务
  • 全球医疗
  • 创新
  • 零售行业
  • 转型
  • 招聘
    招聘
  • 招聘 概观
  • 职业心得
  • 多元与融合
  • 社会改变
数据科学与工程技术

How much can you trust your data?

Ellen König Ellen König

Published: Jul 23, 2020

Data is the fuel for intelligent decision making for both humans and machines. Just like high quality fuel ensures that jet engines run efficiently and reliably in the long run, high quality data fuels effective and reliable decision making. 

Whether it is for decisions taken by corporate executives, frontline staff or intelligent machine learning models, any intelligent enterprise needs high quality data to operate. But unfortunately, data quality issues are very widespread. In a survey conducted by O’Reilly Media in November 2019, only 10 percent of responding companies stated that they do not face data quality problems.

Why does data quality matter so much?

Let’s have a look at three typical data case studies from different ThoughtWorks engagements: 
  • Corporación Favorita, a large Ecuadorian-based grocery retailer, needs to predict how much of a given product will sell in the future, based on historical data. (ThoughtWorkers participated in the linked Kaggle competition.)
  • A large German automotive company, Client Two, needs a product information system that allows their clients to configure the car they want to buy.
  • A large online retailer, Client Three, needs dashboards to track sales and logistics KPIs for their products.
Each of these cases depend on the involved data being of high quality. In the first case study, incomplete or unreliable data will lead to untrustworthy sales predictions, resulting in poor stocking and pricing decisions. 

For our Client One, mismatch between the data in the product information system and the reality of what can be built in the factories currently can result in desired car configurations mistakenly not being offered. Or in cars being purchased that cannot be produced in the factory. Which will lead to worse sales, to customer frustration and possibly legal claims. 

And for our second client, poor data quality will lead to company executives, sales managers, and logistic managers drawing incorrect conclusions about the state of the company’s operations. This could result in reduced customer satisfaction, loss of revenue, increased costs, or misdirected investments.

In all of these cases, low data quality leads to poor business decisions being taken, resulting in undesirable business outcomes such as decreased revenue, customer dissatisfaction and increased costs. Gartner reported in 2018 that surveyed organizations believed they, on average, lost $15 million per year due to data quality issues.

Efforts to address data quality therefore can help directly make companies more effective and profitable.

How good is your company’s data quality?

In a modern business, everyone works with data one way or another, be it producing, managing or using it. Yet like water for fish, we often fail to notice data because it is all around us and, just like fish in the water suffer from bad water quality, we suffer if our data quality decreases. 

Unlike the fish in the water though, we can actually all contribute to addressing data quality issues and that process starts with assessing the current state of our data quality.

Making data quality measurable

Loosely following David Garvin’s widely referenced definition of quality in “Managing Quality” (1988), we can distinguish between three perspectives on data quality: 

Data consumers: Usage perspective
  • Does our data meet our consumers’ expectations?
  • Does our data satisfy the requirements of its usage?
Business: Value perspective 
  • How much value are we getting out of our data?
  • How much are we willing to invest into our data?
Engineering: Standards-based perspective
  • To which degree does our data fulfill specifications?
  • How accurate, complete, and timely is our data? 
To make these perspectives more tangible, we can define data quality dimensions for each of these perspectives. A data quality dimension can be understood as “a set of data quality attributes that represent a single aspect or construct of data quality”. For example, a dimension associated with the usage perspective could be the “relevance” of the data, for the value perspective the “value added” by a data product and for the standards-based perspective the “completeness” of data points.

Data Quality Diagram
Based on the dimensions, we can create specific metrics to measure the quality for our chosen dimensions. Once we know how good our data quality is for those dimensions, we can design specific improvement strategies for each dimension.

Automating the assessment of data quality 

Assessing data quality can be a labor intensive and costly process. Some data quality dimensions used in practice can only be assessed with expert human judgement, but many others can be automated with a little effort. An early investment in automating data quality monitoring can pay continuing dividends over time.

Dimensions that can be measured at data point level include the accuracy of values and the completeness of field values. At dataset level, they include completeness of the data set, uniqueness of data points, and the timeliness of data.

Dimensions that require human judgement usually require additional context or subjective value judgement for assessment. Some examples for these dimensions are: Interpretability, ease of understanding, security.

For those dimensions that we can assess automatically, we can make use of two different validation strategies: Rule-based checks and anomaly detection.

Rule-based checks work well whenever we can define absolute reference points for quality. They are used for conditions that must be met in any case for data to be valid. If these constraints are violated, we know we have a data quality issue. 

Examples are at the data point level are: 
  • Part description must not be empty
  • Opening hours per day must be between 0 and 24
Examples on the dataset level are:
  • There must be exactly 85 unique shops in the dataset
  • All categories must be unique
  • There must be at least 700,000 data points in the dataset

Anomaly detection works well whenever we can define data quality relative to other data points and is defined as  "the identification of rare items, events or observations which raise suspicions"(Wikipedia). It is often used for detecting spikes and drops in time series of metrics data. 

An identified anomaly only tells us that there might be something wrong with the data, which might arise from the data quality issue, or it might be based on an outlier event recorded in the dataset. A detected anomaly should therefore be used as an investigation point for figuring out what happened.

Examples for anomaly based validation constraints are
  • The number of transactions should not change more than 20% for each day
  • The number of car parts on offer should only be increasing over time
With our data quality dimensions and derived metrics and the two different strategies for automated data quality validation, we now have all the pieces we need to implement validation with a data quality monitoring tool.

Case study: Assessing data quality with deequ

Deequ is a Scala library for data quality validation on large datasets with Spark. It is developed by AWS Labs. Based on our experiences, we recommend the library on the ThoughtWorks Tech Radar for organizations to “assess”.

We recently used deequ at the online retailer introduced as Client Three. The data quality gates implemented with deequ prevent bad data from feeding forward to external stakeholders.

The library provides both rule-based checks and anomaly detection. Validation can be implemented with a few lines of code. Here is an example for a rule-based check:
 
val verificationResult = VerificationSuite()

  .onData(data)

  .addCheck(Check(CheckLevel.Error, "Testing our data")

           .isUnique("date")) // should not contain duplicates

  .run()

if (verificationResult.status != CheckStatus.Success) {

  println("We found errors in the data:\n")

}

What is happening here: 
  1. We create an instance of the core validation class VerificationSuite. We can chain all operations needed to define our validation as method calls to this object. 
  2. We configure the data set we want to run our validation on
  3. We add a uniqueness check as the validation we want to use
  4. We run the validation
  5. We check whether the validation succeeded. If not, we can input an alert on this failure. In the example, we are just printing an error message, but we could also log a message, trigger our monitoring system, trigger a notification etc.

    A validation using anomaly detection can be implemented with just a few extra lines of code:
val verificationResult = VerificationSuite()

  .onData(todaysDataset)

  .useRepository(metricsRepository)

  .saveOrAppendResult(ResultKey(System.currentTimeMillis()))

  .addAnomalyCheck(RelativeRateOfChangeStrategy(

maxRateDecrease = Some(0)),

                        Size())

  .run()



if (verificationResult.status != Success) {

  println("Anomaly detected in the Size() metric!")

}

Here, the main difference is the addition of a repository. As anomaly detection involves comparing the current metrics to a previous state, we need to store and access the previous state of the metrics. This is handled by the repository. The anomaly detection itself is configured very similar to the static rule check by calling “addAnomalyCheck”.

Deequ provides a lot of different metric analyzers that can be used to assess data quality. They operate on columns of the dataset or the entire dataset itself and can be used for both rule- and anomaly-based validation.

For example, for the completeness dimensions, there are analyzers to analyze the completeness of fields and the size of the dataset. For the accuracy dimension, we could use the various statistical analyzers deequ provides and describe the data properties we need.

In our project, we found deequ to be worth exploring further. Some of its strengths are 
  • Fast execution of rule check and anomaly detection steps
  • ​Validation can be implemented with very little code
  • Lots of metric analyzers to choose from
  • The library code is fairly easy to understand when you need to dig deeper than the documented examples
  • ​Code and documentation are under very active development
However, as of this date, it is not yet a fully mature project ready for any production use case. We found that the documentation is still incomplete for concepts and examples beyond basic usage. This hinders the implementation of more complex data validation. It could even lead to a faulty implementation of your validation due to misunderstandings that results in incorrect data quality assessments. An example we encountered during our work for Client 3 was implementing uniqueness checks with composite primary keys. One subkey had a low cardinality, which resulted in issues.

As the documentation seems to be under active development, it seems to us like a promising project overall.

Challenges in modern data quality assessment

Tooling, however, is only one of the challenges for effective data quality assessment. I see three other areas as big challenges:

Detecting data quality issues as close to their source as possible. The same as for software defects, the earlier we can detect data quality issues, the easier and cheaper it is to fix them. In a typical data pipeline, a data point will be combined, aggregated and otherwise transformed several times. Each transformation step multiplies the effort required to detect and trace quality issues. Coordinated data quality gates should therefore be implemented along the entire production pipeline of a data product. Ownership of data quality needs to reside with each data product owner along the pipeline. 

Identifying the most impactful data quality issues with relevant validation scenarios. The most impactful data quality issues are those with the biggest effect on the business. Quality gates therefore need to be defined less by what is technically easy to validate, and more by the usage scenarios for the data product. Defining those quality scenarios requires, not only a good understanding of the data, but most importantly a strong understanding of the business domain. 

Complementing automated validation with manual validation efficiently. As mentioned above, only some of the desired data quality dimensions can be assessed with automated validation. Depending on the quality scenarios, we might need additional manual validation. Manual validation usually involves more effort and is not as easily repeatable. Therefore, we need to figure out in which cases manual validation is really required and how to integrate it efficiently into the release process for a data product.

Where should you start assessing your data?

In typical organizations with lots of data sets, assessing all of your data products will be overwhelming. To define priorities, you could ask yourself: 
  • Which KPIs are most sensitive to data quality concerns? 
  • Which data that we provide to customers or partners is essential in core business processes?
  • Which intelligent services are embedded in core business processes?
The data products associated with each answer are what you need to look at first. To figure out how trustworthy these data products are, start by assessing them in their most refined form (right before they are used). This will give you a high level picture of your organization’s most relevant data quality issues. Armed with these insights, you can decide in which areas to focus your data quality improvement approaches. 

Overall, data quality assessments are an effective, but often overlooked way to make your company’s data products more trustworthy. Detecting and fixing data quality issues could help you reduce costs, increase customer satisfaction, and improve revenue, which will ultimately contribute to your company’s overall performance.

Technology Hub

An in-depth exploration of enterprise technology and engineering excellence.

Explore
相关博客
技术策略

Big Data applied in sales process

Mariane Ferroni
了解更多
职业心得

[Career Pathways] I fell in love with programming as a teenager. Today I am a software engineer.

Shohre Mansouri
了解更多
语言、工具与框架

The Either data type as an alternative to throwing exceptions

Mario Fernandez
了解更多
  • 产品及服务
  • 合作伙伴
  • 洞见
  • 加入我们
  • 关于我们
  • 联系我们

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

媒体与第三方机构垂询 | 政策声明 | Modern Slavery statement ThoughtWorks| 辅助功能 | © 2021 ThoughtWorks, Inc.