ThoughtWorks
  • Contacto
  • Português
  • Deutsch
  • English
  • 中文
Visión general
  • Cultura de ingeniería, mentalidad de entrega

    Adopta un enfoque moderno para el desarrollo de software y entrega valor más rápido.

    Toma de desiciones Inteligentes

    Aprovecha los resultados de tus datos para descubrir nuevas fuentes de valor

  • Modelo Operativo sin Frincción

    Mejora la habilidad de tu empresa para adaptarse al cambio

    Estrategia de Plataforma

    Crea plataformas tecnológicas que de adapten a su estrategia de negocio

  • Experience Design y Product Capabilities

    Diseña, entrega y evoluciona rápidamente productos y experiencias excepcionales

    Asociaciones

    Aprovechar nuestra red de socios confiables para ampliar los resultados que brindamos a nuestros clientes

Visión general
  • Sector Automotriz
  • Cleantech y Servicios Públicos
  • Servicios Financieros y Aseguradoras
  • Servicios Médicos
  • Medios y Publicidad
  • Organizaciones sin fines de lucro
  • Sector Público
  • E-commerce y Retail
  • Viajes y Transporte
Visión general

Destacados

  • Tecnología

    Una exploración profunda de tecnología empresarial y excelencia en ingeniería

  • Negocios

    Mantenerse actualizado con los últimos insights empresariales y de industria para líderes digitales

  • Cultura

    El espacio para encontrar contenido y tips de desarrollo profesional, y nuestra visión sobre la justicia social y la inclusión

Publicaciones Digitales y Herramientas

  • Radar Tecnológico

    Nuestra guía de tendencias tecnológicas actuales

  • Opiniones

    Una publicación para líderes digitales

  • Digital Fluency Model

    Un modelo para priorizar las capacidad digitales necesarias para navegar en la incertidumbre

  • Decoder

    Una guía tecnológica de la A a la Z para líderes de negocio

Todos los Insights

  • Artículos

    Opiniones profesionales que ayudarán al crecimiento de tu negocio

  • Blogs

    Opiniones personales de ThoughtWorkers de todo del mundo

  • Libros

    Navega a través de nuestra extensa biblioteca

  • Podcasts

    Emocionantes charlas sobre las últimas tendencias en negocios y tecnología

Visión general
  • Proceso de Aplicación

    Descubre lo que te espera durante nuestro proceso de selección

  • Graduados y cambio de carreras

    Empieza tu carrera en tecnología con el pie derecho

  • Ofertas de trabajo

    Encuentra puestos vacantes en tu región

  • Mantente conectado

    Suscríbete a nuestro boletín mensual

Visión general
  • Conferencias y eventos
  • Diversidad e Inclusión
  • Noticias
  • Código abierto
  • Nuestros líderes
  • Cambio Social
  • Português
  • Deutsch
  • English
  • 中文
ThoughtWorksMenú
  • Cerrar   ✕
  • Nuestros Servicios
  • ¿Con quién trabajamos?
  • Insights
  • Carreras
  • Acerca de Nosotros
  • Contacto
  • Regresar
  • Cerrar   ✕
  • Visión general
  • Cultura de ingeniería, mentalidad de entrega

    Adopta un enfoque moderno para el desarrollo de software y entrega valor más rápido.

  • Experience Design y Product Capabilities

    Diseña, entrega y evoluciona rápidamente productos y experiencias excepcionales

  • Modelo Operativo sin Frincción

    Mejora la habilidad de tu empresa para adaptarse al cambio

  • Toma de desiciones Inteligentes

    Aprovecha los resultados de tus datos para descubrir nuevas fuentes de valor

  • Asociaciones

    Aprovechar nuestra red de socios confiables para ampliar los resultados que brindamos a nuestros clientes

  • Estrategia de Plataforma

    Crea plataformas tecnológicas que de adapten a su estrategia de negocio

  • Regresar
  • Cerrar   ✕
  • Visión general
  • Sector Automotriz
  • Cleantech y Servicios Públicos
  • Servicios Financieros y Aseguradoras
  • Servicios Médicos
  • Medios y Publicidad
  • Organizaciones sin fines de lucro
  • Sector Público
  • E-commerce y Retail
  • Viajes y Transporte
  • Regresar
  • Cerrar   ✕
  • Visión general
  • Destacados

  • Tecnología

    Una exploración profunda de tecnología empresarial y excelencia en ingeniería

  • Negocios

    Mantenerse actualizado con los últimos insights empresariales y de industria para líderes digitales

  • Cultura

    El espacio para encontrar contenido y tips de desarrollo profesional, y nuestra visión sobre la justicia social y la inclusión

  • Publicaciones Digitales y Herramientas

  • Radar Tecnológico

    Nuestra guía de tendencias tecnológicas actuales

  • Opiniones

    Una publicación para líderes digitales

  • Digital Fluency Model

    Un modelo para priorizar las capacidad digitales necesarias para navegar en la incertidumbre

  • Decoder

    Una guía tecnológica de la A a la Z para líderes de negocio

  • Todos los Insights

  • Artículos

    Opiniones profesionales que ayudarán al crecimiento de tu negocio

  • Blogs

    Opiniones personales de ThoughtWorkers de todo del mundo

  • Libros

    Navega a través de nuestra extensa biblioteca

  • Podcasts

    Emocionantes charlas sobre las últimas tendencias en negocios y tecnología

  • Regresar
  • Cerrar   ✕
  • Visión general
  • Proceso de Aplicación

    Descubre lo que te espera durante nuestro proceso de selección

  • Graduados y cambio de carreras

    Empieza tu carrera en tecnología con el pie derecho

  • Ofertas de trabajo

    Encuentra puestos vacantes en tu región

  • Mantente conectado

    Suscríbete a nuestro boletín mensual

  • Regresar
  • Cerrar   ✕
  • Visión general
  • Conferencias y eventos
  • Diversidad e Inclusión
  • Noticias
  • Código abierto
  • Nuestros líderes
  • Cambio Social
Blogs
Selecciona un tema
Ver todos los temasCerrar
Tecnología 
Gestión de Proyectos Agiles La Nube Entrega Continua Ciencia e Ingenieria de Datos Defendiendo el Internet Libre Arquitectura Evolutiva Experiencia de Usuario IoT  Lenguajes, Herramientas y Frameworks Modernización de sistemas heredados Machine Learning & Artificial Intelligence Microservicios Plataformas Seguridad Pruebas de Software Estrategia Digital 
Negocio 
Servicios Financieros Salud Global Innovación Ventas  Transformación 
Carreras 
Hacks Para Tu Carrera Diversidad e Inclusión Cambio Social 
Blogs

Temas

Elegir tema
  • Tecnología
    Tecnología
  • Tecnología Visión General
  • Gestión de Proyectos Agiles
  • La Nube
  • Entrega Continua
  • Ciencia e Ingenieria de Datos
  • Defendiendo el Internet Libre
  • Arquitectura Evolutiva
  • Experiencia de Usuario
  • IoT
  • Lenguajes, Herramientas y Frameworks
  • Modernización de sistemas heredados
  • Machine Learning & Artificial Intelligence
  • Microservicios
  • Plataformas
  • Seguridad
  • Pruebas de Software
  • Estrategia Digital
  • Negocio
    Negocio
  • Negocio Visión General
  • Servicios Financieros
  • Salud Global
  • Innovación
  • Ventas
  • Transformación
  • Carreras
    Carreras
  • Carreras Visión General
  • Hacks Para Tu Carrera
  • Diversidad e Inclusión
  • Cambio Social
Ciencia e Ingenieria de DatosMachine Learning & Artificial IntelligenceTecnología

The curse of the data lake monster

Kiran Prakash Kiran Prakash
Lucy Chambers Lucy Chambers

Published: Feb 11, 2019

Artificial intelligence and machine learning are currently all the rage. Every organization is trying to jump on this bandwagon and cash in on their data reserves. At ThoughtWorks, we’d agree that this tech has huge potential — but as with all things, realizing value depends on understanding how best to use it.

We’re often approached by clients who want to jumpstart their AI initiatives by building a data lake. Often, this plan is seen purely as an infrastructure effort — without clearly defined use cases. The assumption is “if we build a robust data infrastructure, use-cases will present themselves later.”  

In this post, we argue that software is best developed in thin, vertical slices that emphasize use cases and user outcomes, and data-intensive projects are no exception. When it comes to use cases that rely on rich, multi-format data, it can be tempting to start by creating a horizontal data platform layer, sometimes called a data lake. We are going to explore examples of how product thinking can apply to a project where a data lake is being considered as a solution.

What are data lakes anyway?

When we hear the term data lake, it usually implies:
  • A centralized repository of data (operational, customer related, event streams, etc.) with proper documentation and fine-grained access control.
  • Something built and maintained by data engineers so that data scientists can consume data and focus on developing ML use cases etc.
The term itself is often used in a very broad sense. Sometimes there’s a distinction between data lakes which are meant to hold only raw data and lake shore marts which hold processed bounded context (business function) specific representations used for further analysis. Sometimes data lake is used as a catch-all term to describe both (and maybe other things too).

This imprecise definition often leads to an unplanned expansion in scope, budget overruns, and over-engineering. It may be that an Amazon S3 bucket with proper access permission setup could be all you need as your data lake infrastructure for storing raw data. It’s important to define and establish a shared understanding of what data lake means for your organization at the outset.

Most big organizations would benefit from limiting the scope of the data lakes to store only raw data, and from setting up cross-functional product teams to develop ML applications using their own representations (lake shore marts) specific to their use case. Quoting from Martin’s post:
 
A single unified data model is impractical for anything but the smallest organizations. To model even a slightly complex domain you need multiple bounded contexts, each with its own data model.

We’ll see an example of how such a setup could look a bit later in this post.

Build it; they will come

Data lakes seem particularly prone to “build it; they will come” mentalities. There could be multiple possible reasons for this.

Often, data scientists and data engineers are part of different teams. Data scientists are commonly aligned more closely to the business and data engineers closer to infrastructure and IT. This can give rise to siloed thinking and the tendency to consider a data lake purely as an infrastructure problem. Conway’s Law strikes again.

Next, pinning down the specific use-cases and value stream that the data lake will address is a hard problem. It involves talking to users and aligning multiple parties on a common goal. It’s tempting to substitute this hard problem with a material one which is purely technical.

Lastly, the common argument we hear for the “build it they will come” approach is that data lakes should support a wide array of use-cases and shouldn’t be constrained by a particular one. We agree with the premise that a data lake should support more than one use case. We’re just arguing against too much up-front architecture and design that happens before use cases are considered. 

Pitfalls

Designing a data lake in a top-down fashion, without an eye on the end use cases, will almost inevitably result in a poor problem/solution fit.

Without actually using the data to develop models and seeing them work in the real world, without learning and iterating on the feedback, it’s very hard to tell what the optimal representation of your solution is.

In reality about 70–80% of the effort in building an ML application is cleaning and representing the data in a format specific to the use case. It makes little sense to put a lot of effort in data preprocessing without knowing how it will be used to build ML models. The people building the model are likely going to have to reinvest the effort to do similar work.    

Let’s illustrate this with a hypothetical example:

An insurance firm has plans to build a data lake which will revolutionize how its data scientists or BI analysts access and analyze data and generate insights for the company. The firm has grown over the years through merger and acquisition of many smaller insurance companies —hence, it doesn’t have a single, consolidated view of their customer. Instead, its customer details are fragmented among maybe 20 different subsystems based on the product line (health insurance, vehicle insurance, pet insurance, etc.).

Now as part of its data lake initiative it wants to create a consolidated view of the customer. The company spends months coming up with a single definition of the customer which works for all possible future use-cases and are only medium-happy with the result at the end. You can see why this is a hard thing to do.
  1. There probably is no single definition of the customer which works for all the future use-cases equally well.
  2. Fetching customer information, matching and deduplicating them from 20 disparate system is a non-trivial task. This involves massive coordination among different product teams.
Not knowing how this consolidated view of the customer will be used will make this task even harder and open-ended. It’s not possible for the company to prioritize its work and make informed decisions when trade-offs are involved.

Some data lake initiatives are even vaguer and complex than the above example. They aim to consolidate all business entities and events into a central data infrastructure.

If users don’t take care early in the process to ensure that the data contained in the data lake is used, there’s a real risk that the data lake becomes a data swamp — basically a dumping ground for data of varying quality. These cost a lot to maintain and deliver little value to the organization. 

Product thinking for data lakes

We propose a more bottom-up approach to realizing the data lake — one that builds one vertical slice at a time. Let’s see how this could look with the above insurance company:

It starts with the following initial set of use cases:
  • Identifying fraudulent claims so that they can select claims for deeper manual investigation; they have a business goal of reducing fraud by 5% this year.
  • Predicting weather patterns so that they can advise customers to protect their vehicles by bringing them inside when there’s a high chance of storms — thereby reducing vehicle damage claims by 2%.
  • Upselling other insurance products to the customer based on the products they already have. The goal is to increase the conversion rate for online upselling by 3%.
The insurance company sets about the project as follows.

Before the start

There’s a high-level architecture in place and a governance structure that covers documentation standards, guidelines for the required specificity of data, backward compatibility, versioning, discoverability, etc.

A few upfront technical decisions are made at this point, for instance, the decision to go on-premise or in-cloud, which cloud provider to use and which data store to use. These are decisions that are harder to reverse later in the project, hence must be few in number.

There’s an architectural team in place which ensures the fidelity of the architecture and observance of the governance structures while the platform evolves and applications are built on top of it.

There’s a cross-functional delivery team containing both product owners from the business, data engineers and data scientists to productionize the use-cases.

Working through use-cases

The project team knows it has to start small, so it picks the fraud detection use-case as the first vertical slice. The team knows that health and vehicle insurance contribute towards the majority of the claims, so it decides to focus on just these verticals initially. Raw customer and claims data from these two verticals is pulled into the data lake. It’s cleaned and aggregated and represented in a data mart specific to this use case. The next step is to build fraud detection models using these and productionize them.



Now that the first version of a fraud detection model is in production, the team observes that it could improve the model with additional fields which aren’t currently collected. The data scientists who uncovered this are working closely with the data engineers so that this feedback can be acted on quickly. Together they swiftly figure out how to collect these new fields and adapt the model. The new model is significantly better than the first.

This approach would entail using two out of 20 data sources saving themselves a lot of potentially wasted effort — and delivering more effectively towards the 5% target.






Now the fraud detection model is up and running and generating value for the company; the company can focus on leveraging customer data for the upselling use-case. For this, it adds product data and house insurance related customer data to the lake and uses existing customer data to have new representation in their own bounded context. Again, this provides a more effective way to create models that will deliver towards the target of a 2% increase in upselling.



Now the company can move on to the alerts use-case. This is a more complex model and involves leveraging all the raw data they have used so far and adding an additional source: real-time weather data.

They have to rework some of the raw customer data so that it’s appropriate for the new use-case. The benefit of doing this as rework rather than upfront work is that they now understand more about the requirements for the first two use cases. This allows them to do just enough work to create value and not waste effort on speculative work which may be wasted.

To be clear: this process can be parallelized to some extent. We’re arguing for working on articulated use cases — not that all work on other use cases must stop until a use case is complete.

In summary:
  • There’s no single, one-size-fits-all definition of a data lake. To guarantee you get what you want, be specific about the problem you’re trying to solve.
  • Work on articulated use-cases and measurable business goals. Test them and get feedback. Treating data projects as products and not merely as infrastructure will save a lot of wasted effort.
  • Allow your data scientists to work as closely with your data engineers as possible. Chances are you will achieve results faster, the results will be more closely aligned with the purpose they’re trying to solve, and the joint ownership will mean the maintenance effort will be easier to coordinate.
Blogs relacionados
Ciencia e Ingenieria de Datos

Put Data Science Before Data Infrastructure

David Johnston
Aprende más
Ciencia e Ingenieria de Datos

Data Science and Big Data: Two Very Different Beasts

Sean McClure
Aprende más
Experiencia de Usuario

Will Big Data Make me a Designosaur?

Kate Linton
Aprende más
  • Nuestros Servicios
  • ¿Con quién trabajamos?
  • Insights
  • Carreras
  • Acerca de Nosotros
  • Contacto

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

Relaciones con medios de comunicación y analistas | Política de Privacidad | Modern Slavery statement ThoughtWorks| Accesibilidad | © 2021 ThoughtWorks, Inc.