Delta Lake

Published: Nov 20, 2019
Last Updated: Apr 13, 2021
Apr 2021

Delta Lake is an open-source storage layer, implemented by Databricks, that attempts to bring ACID transactions to big data processing. In our Databricks-enabled data lake or data mesh projects, our teams continue to prefer using Delta Lake storage over the direct use of file storage types such S3 or ADLS. Of course this is limited to projects that use storage platforms that support Delta Lake when using Parquet file formats. Delta Lake facilitates concurrent data read/write use cases where file-level transactionality is required. We find Delta Lake's seamless integration with Apache Spark batch and micro-batch APIs greatly helpful, particularly features such as time travel — accessing data at a particular point in time or commit reversion — as well as schema evolution support on write; though there are some limitations on these features.

Nov 2019

Delta Lake is an open-source storage layer by Databricks that attempts to bring transactions to big data processing. One of the problems we often encounter when using Apache Spark is the lack of ACID transactions. Delta Lake integrates with the Spark API and addresses this problem by its use of a transaction log and versioned Parquet files. With its serializable isolation, it allows concurrent readers and writers to operate on Parquet files. Other welcome features include schema enforcement on write and versioning, which allows us to query and revert to older versions of data if necessary. We've started to use it in some of our projects and quite like it.