Data lake table formats

Kunal Tiwary

Published: March 31, 2022

Organizations seeking more flexible use of data have driven the evolution of data architectures from data warehouses to data lakes, and more recently "delta lakes". The relationship between storage, metadata and compute has evolved too. With data warehouses, all these three aspects (storage, metadata and compute) were within the same layer. With the advent of distributed computing frameworks such as Hadoop, Hive, Spark and Presto used within data lakes, some components started to separate out. The idea is to keep processing engines i.e. compute and underlying storage separate while keeping metadata closer to storage so that different processing engines could access metadata and update it (thereby creating a new version of metadata). This allows solving data governance issues for organizations within their data lake architecture. It also enables unification of machine learning and data by making it easy to reproduce models and the data in the past.

ACID, schema evolution, upsert, time travel, incremental consumption etc are some key features that seem to be missing with traditional table formats such as parquet.

Equally, there are challenges with consistency and data integrity with traditional storage. Consistency is a problem when for example the network connection fails or the driver process crashes while moving the files, the target location will only contain a part of the dataset.

Data integrity is a challenge on occasions where Overwrite mode in Spark is being used to write data. There are two operations (delete and write), if the driver fails in between the operation, it will delete previous data and does not write the new data.The fact that the data is not always persisted in a valid state can cause consistency problems.

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg have sprung out. Along with Hive Metastore, these table formats are trying to solve problems that stand in traditional data lakes as discussed in an earlier section. They provide all or a combination of - concurrent read and writes, time travel, batch and streaming use cases, data mutation/correction (merging) for late arriving data, schema evolution and schema enforcement. Delta Lake (https://delta.io) is an open-source storage layer that brings ACID transactions to Apache Spark and the big data workloads. Iceberg (https://iceberg.apache.org) is originally from Netflix and was open-sourced in 2018 as an Apache incubator project and has very similar features to Delta. Hudi (https://hudi.apache.org) is yet another Data Lake storage layer that focuses more on the streaming processor.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

Data lake table formats

Related blogs

Want to unlock your data potential?