A data lake is a repository — typically a large one — for storing data of many types.
Data lakes are systems that store vast quantities of data. Typically, they’re built with the aim of improving corporate decision making.
Data lakes are more flexible and faster than traditional data warehouses.
What is it?
A data lake is a repository for storing large quantities of raw data. That data could come from all corners of the enterprise, ranging from structured operational and transactional data systems that run the business to unstructured external data for things like customer preferences.
They were initially seen as an improvement to traditional data warehouses, which typically needed data to be treated before being stored and where trying to do new types of analysis was slow because it required building and feeding new data into the warehouse.
Data lakes solved those problems by stressing the need to capture data first, in its raw state, and analyzing it later.
Unfortunately data lakes, while solving some of the problems with data warehouses, still didn’t solve the most critical problem — extracting value from the data.
Capturing data and storing it in a lake doesn’t really address the challenge of getting value from that data. Many organizations have been disappointed with their data lakes because of data quality issues: with no curation of the data going into the lake, you can create problems such as duplication and poor data quality.
What’s in for you?
Data lakes are more flexible and faster than traditional data warehouses. Done well, data lakes provide a way of storing big data, which can then be analyzed, enabling you to gain new insights — perhaps into business performance or identifying new customer trends.
Data lakes have also been useful in enabling companies to add large public data sets into their analyzes — perhaps using weather data to see the impact of good weather on their retail business or mapping data to optimize transportation routes for your supply chain.
What are the trade offs?
There is an old rule of thumb that says, “data that isn’t used will go bad, just like ripe bananas.” Whether you are building a data warehouse, a data lake, or a data mesh, building them without identifying how the data will be used is risky. If the source data is a mess when fed into a data lake, it will still be a mess when you try to work with it.
Done right, with appropriate emphasis on data use, a data lake can be a useful technology in your data plans.
Many organizations have been disappointed with their data lake investments because they didn’t do that upfront planning on how the data in their lake would be used. If you build a valuable use case upfront, you’ll find the investment in building a data lake generates a return sooner.