Comparative study of Apache CarbonData, Hudi and Delta Lake

8 min readOct 7, 2020

1. Background

We have seen a lot of interest for an efficient and reliable solution to provide the mutation and transaction capability into the data lakes. In the data lake, it is very common that users generate reports based on a single set of data. As various types of data flow into data lake, the state of data cannot be immutable. Various use cases requiring mutating data includes data changes with time, late arriving data, balancing real time availability and backfilling, state changing data like CDC, data snapshotting, data cleansing etc, While generating reports, these will result in write/update the same set of tables.

Because Hadoop Distributed File System (HDFS) and object stores like file systems, they are not designed for providing transactional support. Implementing transactions in distributed processing environments is a challenging problem. For example, implementation typically has to consider locking access to the storage system, which comes at the cost of overall throughput performance. Storage solutions such as Apache CarbonData, Open Delta lake, Apache Hudi solve these ACID requirements of datalakes, efficiently by pushing these transactional semantics and rules into the file formats themselves or with metadata and file formats combination.

Many users looking at the three major solutions will be in a dilemma to decide under what circumstances should they choose which solution? Today we compare the three major products and help users better choose the solution based on their own scenarios.

2. Apache Hudi

Apache Hudi is a project designed by Uber to meet the needs of its internal data analysis. The fast upsert/delete and compaction functions can be used to solve many real time use cases. The project is active in Apache community and achieved its top project status in Apr 2020

Hudi design goal is just like its name, Hadoop Upserts Deletes and Incrementals, emphasizing that it mainly supports Upserts, Deletes and Incremental data processing. Some key features include

2.1 File management

Hudi organizes a table into a directory structure under a basepath on DFS. Table is broken up into partitions, which are folders containing data files for that partition, very similar to Hive tables.

2.2 Indexing

Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism.

2.3 Table Types

Hudi supports the following table types.

Copy On Write : Stores data using exclusively columnar file formats (e.g parquet). Updates simply version & rewrite the files by performing a synchronous merge during write.

Merge On Read : Stores data using a combination of columnar (e.g parquet) + row based (e.g Avro) file formats. Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or asynchronously.

2.4 Query types

Hudi supports three query types:

Snapshot Queries: Queries are requests that take a “snapshot” of the table as of a given commit or compaction action. When leveraging snapshot queries, the copy-on-write table type exposes only the base/columnar files in the latest file slices and guarantees the same columnar query performance.

Incremental Queries: For copy-on-write tables, incremental queries provide new data written to the table since a given commit or compaction, providing change streams to enable incremental data pipelines.

Read Optimized Queries: Queries see the latest snapshot of a table as of a given commit/compaction action. Exposes only the base/columnar files in latest file versions and guarantees the same columnar query performance compared to a non-Hudi columnar table. Supported only on Merge on read table

2.5 Hudi Tools

Hudi consists of different tools for fast ingesting data from different data sources to HDFS as a Hudi modeled table and further sync up with Hive metastore. The tools include: DeltaStreamer, Datasource API of the Hoodie-Spark, HiveSyncTool and HiveIncremental puller.

3. Apache CarbonData

Apache CarbonData is the eldest of the three and has been contributed by Huawei into the community and is powering the data platform and data lake solutions of Huawei cloud products to handle Petabyte scale workloads. It is an ambitious project bringing goodness of many capabilities under one project. Along with supporting update, delete, merge operations, streaming ingestion, it has loads of advanced features like time series, datamaps for materialized views, Secondary Index and is also integrated into multiple AI platforms such as Tensorflow.

CarbonData does not have a HoodieKey design and does not emphasize the primary key. The operations such as update/delete/merge are implemented through optimized granular joins. CarbonData is tightly integrated with Spark getting all goodness of spark and has many optimizations in the CarbonData layer such as Data skipping and Pushdown. In terms of query, CarbonData supports Spark, Hive, Flink, TensorFlow, pyTorch and Presto. Some key features include

3.1 Query acceleration

Optimizations like multi level indexing, compression and encoding techniques are targeted to improve performance of analytical queries which can include filters, aggregation and distinct counts where users expect sub second response time for point queries on PB level data. Advanced push down optimization's for deep integration with Spark ensure computing is performed close to the data to minimize amount of data read, processed, converted and transmitted(shuffled).

3.2 ACID : Data consistency

No intermediate data on failures, Works by Snapshot isolation, separating reads and writers. ACID compliance for every operation on data (Query, IUD [Insert update delete], Indexes, Data maps, Streaming). Supports Near real time analysis using Columnar and Row based formats to balance analysis performance and streaming ingestion, along with auto handoffs.

3.3 One Data

By integrating multiple engines such as Spark, Hive, Presto, Flink, Tensorflow, Pytorch. Data lake solutions can now keep one copy of data.

3.4 Various Index for Optimizations

Additional indexes like Secondary index, Bloom, Lucene, Geo-Spatial, Materialized view speeds up point, Text, aggregate, time series and Geo spatial queries. CarbonData supports geo-spatial data models through Polygon UDF.

3.5 Upserts and deletes

Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD-2) operations.

3.6 High Scalability

Storage and processing separated for Scale, Suitable for cloud architecture too. Distributed Index server can be launched along with a query engine(like spark, presto) to avoid reload of indexes across runs and for faster and scalable lookups.

4. Delta[Open]

Delta Lake project was open sourced in 2019 under the Apache License and is an important part of Databricks solution. Delta is positioned as a Data Lake storage layer that integrates streaming and batching and supports update/delete/merge. It Provides ACID transaction capabilities for Apache Spark and big data workloads. Some of the key features include.

4.1 ACID transactions :

Delta Lake brings ACID transactions to your data lakes. Delta Lake stores a transaction log to keep track of all the commits made to the table directory to provide ACID transactions. It provides Serializable isolation levels to ensure the data is consistent across multiple users.

4.2 Schema Management & enforcement

Delta Lake Leverages Spark distributed processing power to handle all the metadata, helps to avoid bad data getting to your data lakes by providing the ability to specify the schema and help enforce it. It prevents data corruption by preventing the bad data from getting into the system even before the data is ingested into the data lake by giving sensible error messages.

4.3 Data versioning and time travel

Data in the data lake will be versioned and snapshots are provided so that you can query them as if that snapshot was the current state of the system. This helps us to revert to older versions of our data lake for Audits, rollbacks and stuff like that.

4.4 Open Format

All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

4.5 Unified batch and streaming sink

Near real time analytics. A table in Delta Lake is a batch table as well as a streaming source and sink, making a solution for a Lambda architecture but going one step further since both batch and real-time data are landing in the same sink.

Similar to CarbonData, Delta does not emphasize primary keys, so the update/delete/merge implementations are all based on spark’s join function. In terms of data writing, Delta and Spark are strongly bound. The deep integration with Spark is probably the best feature, in fact, it is the only one with Spark SQL specific commands (ex. MERGE) and it also introduces useful DMLs like “UPDATE WHERE” or “DELETE WHERE” directly in Spark. Delta Lake does not support true data lineage (which is the ability to track when and how data has been copied within the Delta Lake), but does have auditing and versioning (storing the old schemas in the metadata).

5. Wrapping up

Hudi has competitive advantages in terms of feature such as IUD performance together with merge on read. For example, if you are wondering whether to use it with Flink streams, it is not designed for such a use case currently. Hudi Delta Streamer can support “streaming” data ingestion. The “streaming” here is actually a continuous batch processing cycle. But in essence, this is still not regarded as a pure stream ingestion. The community is powered by Uber and has open sourced all its features.

One of the major advantages of Delta is its ability to integrate with Spark especially its stream-batch integrated design. Delta has a good user API and documentation. The community is powered by Databricks that is owning a commercial version with additional features.

CarbonData is the oldest on the market so it has some competitive advantage due to advanced Index such Materialized view, Secondary Index and has been integrated to a variety of streaming/AI engines such Flink, TensorFlow in addition to Spark, Presto and Hive. The community is powered by Huawei and has open sourced all its features.

With the new releases, the three are constantly filling up their missing abilities and may converge with each other in the future or invade each other’s territory. Of course, it is also possible to focus on the scenarios of their own and build their own barriers to its advantage. A performance comparison of these solutions would benefit better to know their offering. Therefore, it is still unknown who wins and who loses.

The below tables summarizes the three from multiple dimensions. It should be noted that the capabilities listed in this table only highlight capability at the end of August 2020.