Imagine you have a large storage system where your company keeps all its data — multiple teams are writing to it at the same time. But suddenly one team's job crashes halfway through. There’s now a big problem: nobody knows what data is actually there and what’s accurate. Some reports show one number while other reports show a different one. Finance can't close the books and the CEO is frustrated at conflicting information.
This is, unfortunately, not that uncommon. That’s because data lake storage systems often lack the capacity to coordinate when multiple things are being written to them at the same time. However, there’s a solution: Apache Iceberg. In this post, I’ll explain some of the issues with data lakes, how Iceberg can fix them, and whether your company actually needs it.
A practical example of data lake challenges
Before we go further, let’s consider a practical example. Imagine an online grocery store: it’s not unlikely that the company will have multiple teams writing to the same data lake:
The order team writes every order placed.
The inventory team updates stock levels.
The delivery team records deliveries.
This system might work fine for a while, but then something breaks. Perhaps the order job logs crashed overnight; orders were written but not recorded to the data lake. Discrepancies appear — everyone has different numbers and it becomes incredibly difficult to establish the truth.
Iceberg can solve this problem by acting as a coordinating layer; it helps keep careful track of:
What files are actually in the data lake at a given moment right now.
What changed and when.
Who’s allowed to write where.
What the old state was (for looking back in time).
In short, Iceberg sits between your data storage and the programs that read the data.
How Apache Iceberg works
How does it do this, though? Primarily, there are three core parts to Iceberg that allow it to act as a data lake coordinator.
Metadata files. These are instruction files that say "this is what the table looks like right now." Every time something is written, a new metadata file is created. Think of it like a snapshot of your data lake at that moment. Old snapshots are kept so you can look back.
Manifest file. When you have millions of files, you can't check each one every time someone reads data. Manifest files are like an index in a book. They say "files 1-1000 have customer names starting with A-M, files 1001-2000 have names starting with N-Z." This makes searches much faster.
Delete files. Instead of erasing a file completely (which is slow), Iceberg keeps a list of "rows to ignore." It's faster than rewriting everything.
Together, these elements provide a layer of organization across your data lake. It makes it easier to understand how things may have changed and, ultimately, to determine what’s true and accurate. Because of the way it’s built it’s also relatively resource-efficient (that’s not to say there aren’t performance trade-offs, but we’ll come to that later).
What problems does Iceberg solve?
At a high-level, then, Iceberg solves coordination challenges. But it’s worth exploring these challenges in more detail — it gives a clearer picture of how Iceberg actually works and where it can help.
Atomic writes
With Iceberg, multiple systems can write at the same time without corrupting data.
Imagine system A writes 50 files while system B writes 50 files to the same table, only system A's job crashes before all 50 have been recorded — the table is confused about what data belongs there. With Iceberg, when system A tries to finish once the crash has been resolved, it checks if anything changed. If it did, it automatically starts over with the new information; If system B already finished, system A doesn't overwrite it.
Looking up old data
Without Iceberg, to look up old data you’d need to dig through backup files. This can take hours — it also opens up opportunities for errors, like restoring the wrong version. With Iceberg, though, every change creates a snapshot, which can be requested whenever required.
It’s important to note that keeping old snapshots uses additional storage; however, this is usually worth it for compliance and debugging.
Adding columns without breaking things
Sometimes, you might add a column, but your old files still have the old structure. Some programs read the old files and crash while some read new files and crash. With Iceberg, old files keep their old structure. New files have the new column. When you read the old files, the system puts a "null" (empty) value in the new column.
Fast deletions
With the introduction of data privacy regulations around the world, it’s not uncommon for deletion requests to be made. Aside from compliance, though, being able to delete data easily is an important facet of good data stewardship.
Without Iceberg this process can be tricky — you may have to read all files, filter out some specific data and rewrite all files. This can take hours. However, with it, you create a "delete file" that says "ignore all rows where customer_id = 123", it’s done instantly. The actual cleanup happens later when the system has time.
Hidden partitioning
Normally, you have to manually organize your data into folder structures like /year=2024/month=01/day=15/. But with Iceberg's hidden partitioning, you don't need to create these folders yourself. You tell Iceberg once: "organize data by year, month, and day," and it does it automatically behind the scenes. This means teams can query data without knowing the folder structure, and you can change how data is organized without breaking queries or moving files around. It's simpler for users and more flexible for your organization.
When do you need Iceberg? And when do you not?
Apache Iceberg is an incredibly effective solution to data lake coordination issues. However, it’s not always the right solution.
Use it when/if:
You have multiple systems that write to the same table at the same time.
You frequently add columns for things like new features and requirements.
Debugging data issues is painful and you’re struggling to locate causes.
Different teams use different tools to read the data.
You probably don’t need Iceberg if:
You load data once per day. Your daily batch job writes to a new partition. Nothing changes after that. Iceberg adds unnecessary complexity.
Your data never overlaps. Each day's data goes to a separate folder. Teams never write to the same place at the same time. Simple partitioning is enough.
Your team doesn't understand data infrastructure yet. Iceberg requires managing old snapshots, cleaning up deleted files, and monitoring metadata. If your team struggles with basic ETL, this will distract you.
Most queries use only a tiny part of the data. If 95% of queries access only 5% of files, Iceberg's smart file selection doesn't help much.
Essentially, as useful as Iceberg is, there are a number of costs and trade-offs that teams need to consider before using it.
Old snapshots and metadata use 5-15% extra space. For a 1 TB data lake, that's 50-150 GB.
The first time you query a table, it's 2-5% slower (checking metadata). After that, it's cached, so it's fast.
Someone has to clean up old snapshots and manage deleted files; this is roughly 5-10% extra operational work.
Your tools need to support Iceberg; some BI tools may need connectors.
In many instances these trade-offs will be more than worthwhile. However, if Iceberg is unnecessary for the way your organization manages data the additional complexity and overheads aren’t helpful.
Solving real problems
Apache Iceberg solves a range of real and common data lake problems. It helps ensure:
Concurrent writes don't corrupt data.
You can look at old data instantly.
Adding columns doesn't break things.
You can delete things quickly.
While it isn’t a silver bullet and isn’t necessarily relevant to every organization, in many contexts and use cases it can go a long way to helping organizations better leverage data lakes.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.