ThoughtWorks
  • 联系我们
  • Español
  • Português
  • Deutsch
  • English
概况
  • 工匠精神和科技思维

    采用现代的软件开发方法,更快地交付价值

    智能驱动的决策机制

    利用数据资产解锁新价值来源

  • 低摩擦的运营模式

    提升组织的变革响应力

    企业级平台战略

    创建与经营战略发展同步的灵活的技术平台

  • 客户洞察和数字化产品能力

    快速设计、交付及演进优质产品和卓越体验

    合作伙伴

    利用我们可靠的合作商网络来扩大我们为客户提供的成果

概况
  • 汽车企业
  • 清洁技术,能源与公用事业
  • 金融和保险企业
  • 医疗企业
  • 媒体和出版业
  • 非盈利性组织
  • 公共服务机构
  • 零售业和电商
  • 旅游业和运输业
概况

特色

  • 技术

    深入探索企业技术与卓越工程管理

  • 商业

    及时了解数字领导者的最新业务和行业见解

  • 文化

    分享职业发展心得,以及我们对社会公正和包容性的见解

数字出版物和工具

  • 技术雷达

    对前沿技术提供意见和指引

  • 视野

    服务数字读者的出版物

  • 数字化流畅度模型

    可以将应对不确定性所需的数字能力进行优先级划分的模型

  • 解码器

    业务主管的A-Z技术指南

所有洞见

  • 文章

    助力商业的专业洞见

  • 博客

    ThoughtWorks 全球员工的洞见及观点

  • 书籍

    浏览更多我们的书籍

  • 播客

    分析商业和技术最新趋势的精彩对话

概况
  • 申请流程

    面试准备

  • 毕业生和变换职业者

    正确开启技术生涯

  • 搜索工作

    在您所在的区域寻找正在招聘的岗位

  • 保持联系

    订阅我们的月度新闻简报

概况
  • 会议与活动
  • 多元与包容
  • 新闻
  • 开源
  • 领导层
  • 社会影响力
  • Español
  • Português
  • Deutsch
  • English
ThoughtWorks菜单
  • 关闭   ✕
  • 产品及服务
  • 合作伙伴
  • 洞见
  • 加入我们
  • 关于我们
  • 联系我们
  • 返回
  • 关闭   ✕
  • 概况
  • 工匠精神和科技思维

    采用现代的软件开发方法,更快地交付价值

  • 客户洞察和数字化产品能力

    快速设计、交付及演进优质产品和卓越体验

  • 低摩擦的运营模式

    提升组织的变革响应力

  • 智能驱动的决策机制

    利用数据资产解锁新价值来源

  • 合作伙伴

    利用我们可靠的合作商网络来扩大我们为客户提供的成果

  • 企业级平台战略

    创建与经营战略发展同步的灵活的技术平台

  • 返回
  • 关闭   ✕
  • 概况
  • 汽车企业
  • 清洁技术,能源与公用事业
  • 金融和保险企业
  • 医疗企业
  • 媒体和出版业
  • 非盈利性组织
  • 公共服务机构
  • 零售业和电商
  • 旅游业和运输业
  • 返回
  • 关闭   ✕
  • 概况
  • 特色

  • 技术

    深入探索企业技术与卓越工程管理

  • 商业

    及时了解数字领导者的最新业务和行业见解

  • 文化

    分享职业发展心得,以及我们对社会公正和包容性的见解

  • 数字出版物和工具

  • 技术雷达

    对前沿技术提供意见和指引

  • 视野

    服务数字读者的出版物

  • 数字化流畅度模型

    可以将应对不确定性所需的数字能力进行优先级划分的模型

  • 解码器

    业务主管的A-Z技术指南

  • 所有洞见

  • 文章

    助力商业的专业洞见

  • 博客

    ThoughtWorks 全球员工的洞见及观点

  • 书籍

    浏览更多我们的书籍

  • 播客

    分析商业和技术最新趋势的精彩对话

  • 返回
  • 关闭   ✕
  • 概况
  • 申请流程

    面试准备

  • 毕业生和变换职业者

    正确开启技术生涯

  • 搜索工作

    在您所在的区域寻找正在招聘的岗位

  • 保持联系

    订阅我们的月度新闻简报

  • 返回
  • 关闭   ✕
  • 概况
  • 会议与活动
  • 多元与包容
  • 新闻
  • 开源
  • 领导层
  • 社会影响力
博客
选择主题
查看所有话题关闭
技术 
敏捷项目管理 云 持续交付 数据科学与工程 捍卫网络自由 演进式架构 体验设计 物联网 语言、工具与框架 遗留资产现代化 Machine Learning & Artificial Intelligence 微服务 平台 安全 软件测试 技术策略 
商业 
金融服务 全球医疗 创新 零售行业 转型 
招聘 
职业心得 多元与融合 社会改变 
博客

话题

选择主题
  • 技术
    技术
  • 技术 概观
  • 敏捷项目管理
  • 云
  • 持续交付
  • 数据科学与工程
  • 捍卫网络自由
  • 演进式架构
  • 体验设计
  • 物联网
  • 语言、工具与框架
  • 遗留资产现代化
  • Machine Learning & Artificial Intelligence
  • 微服务
  • 平台
  • 安全
  • 软件测试
  • 技术策略
  • 商业
    商业
  • 商业 概观
  • 金融服务
  • 全球医疗
  • 创新
  • 零售行业
  • 转型
  • 招聘
    招聘
  • 招聘 概观
  • 职业心得
  • 多元与融合
  • 社会改变
数据科学与工程Machine Learning & Artificial Intelligence技术

The curse of the data lake monster

Kiran Prakash Kiran Prakash
Lucy Chambers Lucy Chambers

Published: Feb 11, 2019

Artificial intelligence and machine learning are currently all the rage. Every organization is trying to jump on this bandwagon and cash in on their data reserves. At ThoughtWorks, we’d agree that this tech has huge potential — but as with all things, realizing value depends on understanding how best to use it.

We’re often approached by clients who want to jumpstart their AI initiatives by building a data lake. Often, this plan is seen purely as an infrastructure effort — without clearly defined use cases. The assumption is “if we build a robust data infrastructure, use-cases will present themselves later.”  

In this post, we argue that software is best developed in thin, vertical slices that emphasize use cases and user outcomes, and data-intensive projects are no exception. When it comes to use cases that rely on rich, multi-format data, it can be tempting to start by creating a horizontal data platform layer, sometimes called a data lake. We are going to explore examples of how product thinking can apply to a project where a data lake is being considered as a solution.

What are data lakes anyway?

When we hear the term data lake, it usually implies:
  • A centralized repository of data (operational, customer related, event streams, etc.) with proper documentation and fine-grained access control.
  • Something built and maintained by data engineers so that data scientists can consume data and focus on developing ML use cases etc.
The term itself is often used in a very broad sense. Sometimes there’s a distinction between data lakes which are meant to hold only raw data and lake shore marts which hold processed bounded context (business function) specific representations used for further analysis. Sometimes data lake is used as a catch-all term to describe both (and maybe other things too).

This imprecise definition often leads to an unplanned expansion in scope, budget overruns, and over-engineering. It may be that an Amazon S3 bucket with proper access permission setup could be all you need as your data lake infrastructure for storing raw data. It’s important to define and establish a shared understanding of what data lake means for your organization at the outset.

Most big organizations would benefit from limiting the scope of the data lakes to store only raw data, and from setting up cross-functional product teams to develop ML applications using their own representations (lake shore marts) specific to their use case. Quoting from Martin’s post:
 
A single unified data model is impractical for anything but the smallest organizations. To model even a slightly complex domain you need multiple bounded contexts, each with its own data model.

We’ll see an example of how such a setup could look a bit later in this post.

Build it; they will come

Data lakes seem particularly prone to “build it; they will come” mentalities. There could be multiple possible reasons for this.

Often, data scientists and data engineers are part of different teams. Data scientists are commonly aligned more closely to the business and data engineers closer to infrastructure and IT. This can give rise to siloed thinking and the tendency to consider a data lake purely as an infrastructure problem. Conway’s Law strikes again.

Next, pinning down the specific use-cases and value stream that the data lake will address is a hard problem. It involves talking to users and aligning multiple parties on a common goal. It’s tempting to substitute this hard problem with a material one which is purely technical.

Lastly, the common argument we hear for the “build it they will come” approach is that data lakes should support a wide array of use-cases and shouldn’t be constrained by a particular one. We agree with the premise that a data lake should support more than one use case. We’re just arguing against too much up-front architecture and design that happens before use cases are considered. 

Pitfalls

Designing a data lake in a top-down fashion, without an eye on the end use cases, will almost inevitably result in a poor problem/solution fit.

Without actually using the data to develop models and seeing them work in the real world, without learning and iterating on the feedback, it’s very hard to tell what the optimal representation of your solution is.

In reality about 70–80% of the effort in building an ML application is cleaning and representing the data in a format specific to the use case. It makes little sense to put a lot of effort in data preprocessing without knowing how it will be used to build ML models. The people building the model are likely going to have to reinvest the effort to do similar work.    

Let’s illustrate this with a hypothetical example:

An insurance firm has plans to build a data lake which will revolutionize how its data scientists or BI analysts access and analyze data and generate insights for the company. The firm has grown over the years through merger and acquisition of many smaller insurance companies —hence, it doesn’t have a single, consolidated view of their customer. Instead, its customer details are fragmented among maybe 20 different subsystems based on the product line (health insurance, vehicle insurance, pet insurance, etc.).

Now as part of its data lake initiative it wants to create a consolidated view of the customer. The company spends months coming up with a single definition of the customer which works for all possible future use-cases and are only medium-happy with the result at the end. You can see why this is a hard thing to do.
  1. There probably is no single definition of the customer which works for all the future use-cases equally well.
  2. Fetching customer information, matching and deduplicating them from 20 disparate system is a non-trivial task. This involves massive coordination among different product teams.
Not knowing how this consolidated view of the customer will be used will make this task even harder and open-ended. It’s not possible for the company to prioritize its work and make informed decisions when trade-offs are involved.

Some data lake initiatives are even vaguer and complex than the above example. They aim to consolidate all business entities and events into a central data infrastructure.

If users don’t take care early in the process to ensure that the data contained in the data lake is used, there’s a real risk that the data lake becomes a data swamp — basically a dumping ground for data of varying quality. These cost a lot to maintain and deliver little value to the organization. 

Product thinking for data lakes

We propose a more bottom-up approach to realizing the data lake — one that builds one vertical slice at a time. Let’s see how this could look with the above insurance company:

It starts with the following initial set of use cases:
  • Identifying fraudulent claims so that they can select claims for deeper manual investigation; they have a business goal of reducing fraud by 5% this year.
  • Predicting weather patterns so that they can advise customers to protect their vehicles by bringing them inside when there’s a high chance of storms — thereby reducing vehicle damage claims by 2%.
  • Upselling other insurance products to the customer based on the products they already have. The goal is to increase the conversion rate for online upselling by 3%.
The insurance company sets about the project as follows.

Before the start

There’s a high-level architecture in place and a governance structure that covers documentation standards, guidelines for the required specificity of data, backward compatibility, versioning, discoverability, etc.

A few upfront technical decisions are made at this point, for instance, the decision to go on-premise or in-cloud, which cloud provider to use and which data store to use. These are decisions that are harder to reverse later in the project, hence must be few in number.

There’s an architectural team in place which ensures the fidelity of the architecture and observance of the governance structures while the platform evolves and applications are built on top of it.

There’s a cross-functional delivery team containing both product owners from the business, data engineers and data scientists to productionize the use-cases.

Working through use-cases

The project team knows it has to start small, so it picks the fraud detection use-case as the first vertical slice. The team knows that health and vehicle insurance contribute towards the majority of the claims, so it decides to focus on just these verticals initially. Raw customer and claims data from these two verticals is pulled into the data lake. It’s cleaned and aggregated and represented in a data mart specific to this use case. The next step is to build fraud detection models using these and productionize them.



Now that the first version of a fraud detection model is in production, the team observes that it could improve the model with additional fields which aren’t currently collected. The data scientists who uncovered this are working closely with the data engineers so that this feedback can be acted on quickly. Together they swiftly figure out how to collect these new fields and adapt the model. The new model is significantly better than the first.

This approach would entail using two out of 20 data sources saving themselves a lot of potentially wasted effort — and delivering more effectively towards the 5% target.






Now the fraud detection model is up and running and generating value for the company; the company can focus on leveraging customer data for the upselling use-case. For this, it adds product data and house insurance related customer data to the lake and uses existing customer data to have new representation in their own bounded context. Again, this provides a more effective way to create models that will deliver towards the target of a 2% increase in upselling.



Now the company can move on to the alerts use-case. This is a more complex model and involves leveraging all the raw data they have used so far and adding an additional source: real-time weather data.

They have to rework some of the raw customer data so that it’s appropriate for the new use-case. The benefit of doing this as rework rather than upfront work is that they now understand more about the requirements for the first two use cases. This allows them to do just enough work to create value and not waste effort on speculative work which may be wasted.

To be clear: this process can be parallelized to some extent. We’re arguing for working on articulated use cases — not that all work on other use cases must stop until a use case is complete.

In summary:
  • There’s no single, one-size-fits-all definition of a data lake. To guarantee you get what you want, be specific about the problem you’re trying to solve.
  • Work on articulated use-cases and measurable business goals. Test them and get feedback. Treating data projects as products and not merely as infrastructure will save a lot of wasted effort.
  • Allow your data scientists to work as closely with your data engineers as possible. Chances are you will achieve results faster, the results will be more closely aligned with the purpose they’re trying to solve, and the joint ownership will mean the maintenance effort will be easier to coordinate.
相关博客
数据科学与工程

Put Data Science Before Data Infrastructure

David Johnston
了解更多
数据科学与工程

Data Science and Big Data: Two Very Different Beasts

Sean McClure
了解更多
体验设计

Will Big Data Make me a Designosaur?

Kate Linton
了解更多
  • 产品及服务
  • 合作伙伴
  • 洞见
  • 加入我们
  • 关于我们
  • 联系我们

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

媒体与第三方机构垂询 | 政策声明 | Modern Slavery statement ThoughtWorks| 辅助功能 | © 2021 ThoughtWorks, Inc.