Comparative study of Apache Iceberg, Open Delta, Apache CarbonData and Hudi

11 min readApr 25, 2021

1. Background:

We have seen a lot of interest for an efficient and reliable solution to provide the mutation and transaction capability into the data lakes. In the data lake, it is very common that users generate reports based on a single set of data. As various types of data flow into data-lake, the state of data cannot be immutable. Various use cases requiring mutating data includes data changes with time, late arriving data, balancing real time availability and backfilling, state changing data like CDC, data snapshotting, data cleansing etc, While generating reports, these will result in write/update the same set of tables.

Because Hadoop Distributed File System (HDFS) and object stores like file systems, they are not designed for providing transactional support. Implementing transactions in distributed processing environments is a challenging problem. For example, implementation typically has to consider locking access to the storage system, which comes at the cost of overall throughput performance. Storage solutions such as Apache CarbonData, Apache Iceberg, Open Delta and Apache Hudi solve these ACID requirements of data-lakes, efficiently by pushing these transactional semantics and rules into the file formats themselves or with metadata and file formats combination.

Many users looking at the four major solutions will be in a dilemma to decide under what circumstances should they choose which solution? Today we try to compare the four major solutions available in open source and help users better choose the solution based on their own scenarios.

2. Apache Hudi

Apache Hudi is a project designed by Uber to meet the needs of its internal data analysis. The fast upsert/delete and compaction functions can be used to solve many real time use cases. The project is active in Apache community and achieved its top project status in Apr 2020

Hudi design goal is just like its name, Hadoop Upserts Deletes and Incrementals, emphasizing that it mainly supports Upserts, Deletes and Incremental data processing. Some key features include

2.1 File management

Hudi organizes a table into a directory structure under a basepath on DFS. Table is broken up into partitions, which are folders containing data files for that partition, very similar to Hive tables.

2.2 Indexing

Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism.

2.3 Table Types

Hudi supports the following table types.

Copy On Write : Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.

Merge On Read : Stores data using a combination of columnar (e.g parquet) + row based (e.g Avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

2.4 Query types

Hudi supports three query types:

Snapshot Queries: Queries are requests that take a “snapshot” of the table as of a given commit or compaction action. When leveraging snapshot queries, the copy-on-write table type exposes only the base/columnar files in the latest file slices and guarantees the same columnar query performance.

Incremental Queries: For copy-on-write tables, incremental queries provide new data written to the table since a given commit or compaction, providing change streams to enable incremental data pipelines.

Read Optimized Queries: Queries see the latest snapshot of a table as of a given commit/compaction action. Exposes only the base/columnar files in latest file versions and guarantees the same columnar query performance compared to a non-Hudi columnar table. Supported only on Merge on read table

2.5 Hudi Tools

Hudi consists of different tools for fast ingesting data from different data sources to HDFS as a Hudi modeled table and further sync up with Hive metastore. The tools include: DeltaStreamer, Datasource API of the Hoodie-Spark, HiveSyncTool and HiveIncremental puller.

3. Apache CarbonData

Apache CarbonData is the eldest of the three and has been contributed by Huawei into the community and is powering the data platform and data lake solutions of Huawei cloud products to handle Petabyte scale workloads. It is an ambitious project bringing goodness of many capabilities under one project. Along with supporting update, delete, merge operations, streaming ingestion, it has loads of advanced features like time series, datamaps for materialized views, Secondary Index and is also integrated into multiple AI platforms such as Tensorflow.

CarbonData does not have a HoodieKey design and does not emphasize the primary key. The operations such as update/delete/merge are implemented through optimized granular joins. CarbonData is tightly integrated with Spark getting all goodness of spark and has many optimizations in the CarbonData layer such as Data skipping and Pushdown. In terms of query, CarbonData supports Spark, Hive, Flink, TensorFlow, pyTorch and Presto. Some key features include

3.1 Query acceleration

Optimizations like multi-level indexing, compression and encoding techniques are targeted to improve performance of analytical queries which can include filters, aggregation and distinct counts where users expect sub second response time for point queries on PB level data. Advanced push down optimization’s for deep integration with Spark ensure computing is performed close to the data to minimize amount of data read, processed, converted and transmitted (shuffled).

3.2 ACID : Data consistency

No intermediate data on failures, Works by Snapshot isolation, separating reads and writers. ACID compliance for every operation on data (Query, IUD [Insert update delete], Indexes, Data maps, Streaming). Supports Near real time analysis using Columnar and Row based formats to balance analysis performance and streaming ingestion, along with auto handoffs.

3.3 One Data

By integrating multiple engines such as Spark, Hive[Read], Presto[Read], Flink[Streaming], Tensorflow, Pytorch. Data lake solutions can now keep one copy of data.

3.4 Various Index for Optimizations

Additional indexes like Secondary index, Bloom, Lucene, Geo-Spatial, Materialized view speeds up point, Text, aggregate, time series and Geo spatial queries. CarbonData supports geo-spatial data models through Polygon UDF.

3.5 Upserts and deletes

Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD-2) operations.

3.6 High Scalability

Storage and processing separated for Scale, Suitable for cloud architecture too. Distributed Index server can be launched along with a query engine (like spark, presto) to avoid reload of indexes across runs and for faster and scalable lookups.

4. Apache Iceberg

The project was originally developed at Netflix. It was open-sourced in 2018 and became a top project as of May 2020. Iceberg does not emphasize the primary key. Without a primary key, operations such as update/delete/merge must be implemented through Join, and Join requires an execution engine. Iceberg is not bound to a certain engine nor does it have its own engine. Iceberg supports Apache Spark, Trino (PrestoSQL) and Apache Flink for both reads and writes. Finally Iceberg offers read support for Apache Hive.

4.1 ACID:

Iceberg brings ACID transactions to your data lakes. The ACID transaction model is snapshot based. A snapshot is a complete list of the file up in table. The table state is maintained in Metadata files. All change to the table state create a new Metadata file, and the replace the old Metadata file with atomic swap. Iceberg will apply the optimistic concurrency control and a user could able to do the time travel queries according to the snapshot id and the timestamp.

4.2 Table Evolution

Iceberg supports in-place table evolution. You can evolve a table schema just like SQL even in nested structures or change partition layout when data volume changes. Iceberg does not require costly distractions, like rewriting table data or migrating to a new table..

4.3 Schema Evolution

Iceberg has inbuilt support for schema evolution that provides guarantees against committing breaking changes to your table. Iceberg makes a guarantee that schema changes are independent and free of side-effects. Iceberg uses a unique ID to track each field in a schema, and maps a field name to an ID.

4.4 Partition Evolution

Due to Iceberg implementation of hidden partitioning, Iceberg can also offer partition spec evolution as a feature. This means you can change the granularity or column that you are partitioning by without breaking the table. Partition evolution is a metadata operation and does not eagerly rewrite files, so your old data can co-exist in the table with any new data.

4.5 Time travel

Data in the data lake will be versioned and snapshots are provided so that you can query them as if that snapshot was the current state of the system. To select a specific table snapshot or the snapshot at some time, Iceberg supports two Spark read options: snapshot-id selects a specific table snapshot and as-of-timestamp selects the current snapshot at a timestamp in milliseconds. Time travel is not yet supported by Spark SQL syntax.

4.6 Open Format

All data in Iceberg is stored using immutable file formats such as AVRO, Parquet and ORC to leverage the efficient compression and encoding schemes.

5. Delta[Open]

Delta Lake project was open sourced in 2019 under the Apache License and is an important part of Databricks solution. Delta is positioned as a Data Lake storage layer that integrates streaming and batching and supports update/delete/merge. It Provides ACID transaction capabilities for Apache Spark and big data workloads. Some of the key features include.

5.1 ACID transactions :

Delta Lake brings ACID transactions to your data lakes. Delta Lake stores a transaction log to keep track of all the commits made to the table directory to provide ACID transactions. It provides Serializable isolation levels to ensure the data consistent across multiple users.

5.2 Schema Management & enforcement

Delta Lake Leverages Spark distributed processing power to handle all the metadata, helps to avoid bad data getting your data lakes by providing the ability to specify the schema and help enforce it. It prevents data corruption by preventing the bad data get into the system even before the data is ingested into the data lake by giving sensible error messages.

5.3 Data versioning and Time travel

Data in the data lake will be versioned and snapshots are provided so that you can query them as if that snapshot was the current state of the system. This helps us to revert to older versions of our data lake for Audits, rollbacks and stuff like that.

5.4 Open Format

All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

5.5 Unified batch and streaming sink

Near real time analytics. A table in Delta Lake is a batch table as well as a streaming source and sink, making a solution for a Lambda architecture but going one step further since both batch and real-time data are landing in the same sink.

Similar to CarbonData, Delta does not emphasize primary keys, so the update/delete/merge implementations are all based on spark’s join function. In terms of data writing, Delta and Spark are strongly bound. The deep integration with Spark is probably the best feature, in fact, it is the only one with Spark SQL specific commands (ex. MERGE) and it introduces also useful DMLs like “UPDATE WHERE”or“DELETE WHERE” directly in Spark. Delta Lake does not support true data lineage (which is the ability to track when and how data has been copied within the Delta Lake), but does have auditing and versioning (storing the old schemas in the metadata).

6. Wrapping up

Hudi has competitive advantage in terms of feature such as IUD performance together with merge on read. It supports variety of query engines such as Flink and Spark. Hudi Delta Streamer can support “streaming” data ingestion. The “streaming” here is actually a continuous batch processing cycle. But in essence, this is still not regarded as a pure stream ingestion. The community is powered by Uber and has open sourced all its features.

Iceberg is not bound to any specific engine. Iceberg has best degree of abstraction in terms of engine read, engine write, storage adoption and file format. It has some native optimization like predicate push down and has a native vectorized reader. Iceberg will relieve performance issues related to S3 object listing or Hive Metastore partition enumeration. On the contrary, support for deletions and mutations is still preliminary, and there is operational overhead involved with data retention. The community is powered by Netflix and has open sourced all its features.

CarbonData is the eldest on the market so it has some competitive advantage due to advanced datamap such Materialized view, Secondary Index and has been integrated to variety of streaming/AI engines like Flink, TensorFlow in addition to Spark, Presto[Read path] and Hive[Read path]. CarbonData is well integrated with Spark, so it could enjoy or share the benefit of performance optimization from Spark such as Vectorization, predicate push down, Data skipping via statistics and has built some useful command like CleanFile to clean Marked for Delete and Compacted segments with the ability to recover files using the Trash mechanism. The community is powered by Huawei and has open sourced all its features.

Delta’s ability to integrate with Spark especially its stream-batch integrated design is a major advantage. Delta inherits the benefit of performance optimization from Spark such as Vectorization, Data skipping via statistics of Parquet and Delta has also built some useful command like Vacuum to clean up. Delta has a good user API and documentation. It is important to understand that Delta Lake, while open source, will likely always lag behind Delta Engine to act as a product differentiator. The community is powered by Databricks that is owning a commercial version with additional features.

With the new releases, the four are constantly filling up their missing abilities and may converge with each other in the future or invade each other’s territory. Of course, it is also possible to focus on the scenarios of their own and build their own barriers to its advantage. Therefore, it is still unknown who wins and who loses. The below tables summarizes the four from multiple dimensions such as.

1. ACID and isolation level

2. Schema Evolution and support for Data mutations

3. Degree of interface abstraction

4. Query optimization features

5. Major capability

6. Stream and Batch engine integration

7. Community status quo

It should be noted that the capabilities listed in this table only highlight capability as on Apr 2021 and as read through the various reference materials.