Revisiting Big Data Formats: Apache Iceberg, Delta Lake and Apache Hudi

Novel Big Data formats, such as Apache Parquet, Apache ORC or Apache Avro have been years ago the game changer for processing massive amounts of data efficiently as I wrote in a previous blog post (aside of the Big Data platforms leveraging them).

Nowadays we see the emergence of new Big Data formats, such as Apache Iceberg, Delta Lake or Apache Hudi. I will explain in this blog post what new features they have and how they compare to the „traditional“ Big Data formats.

I start with quickly recapping on „traditional“ Big Data formats. Afterwards I will present new features of Apache Iceberg, Delta Lake and Apache Hudi. I will explain operational and migration aspects. Then, I will investigate the need for data catalogs to store the metadata around these formats and support by Big Data tools. Finally, i will discuss novel use cases for those formats, provide some criteria to choose format(s) and give a future outlook.

Recapping Big Data Formats

Big Data formats, such as Apache Parquet or Apache Orc, were introduced in 2013 to improve processing of large data. Apache Avro has been introduced in 2009, but its main purpose was on exchanging of Big Data between systems („a data serialization system“) in an efficient manner.

Apache Parquet, Apache ORC and partially Apache Avro introduced novel features, which significantly accelerated Big Data Processing:

  • Columnar-storage – this stores the data in a columnar fashion instead of rows – this is very suitable for analytics of large data as only the data that is used need to be read and other data can be skipped
  • Binary representation of data instead of text (e.g. CSV, JSON, XML are text formats) – this makes data much more compact and much more efficient to process
  • Columnar compression using various compression algorithms – this reduces IO significantly and this benefits large data volumes
  • Built-in indexes (e.g. min-max, dictionary, bloom filter) that allows filtering data without reading it (if you use the right data types)

They have been used now for years and are suitable for data volumes in the petabyte and more area. However, not everything was solved efficiently or features were missing:

  • One cannot update/delete single rows/cells – except via rewriting a full partition
  • The partitions are fixed and a lot of partitions (more than a couple of hundreds) become inefficient due to how they are represented in the metastore database. Querying of partitions has to be explicit to get performance improvements and many users do not know this resulting in poor performance of queries
  • Big data processing may produce a lot of small files (especially when processing is highly parallel). This requires to write custom logic to combine those lots of small files in few large files
  • Cocurrent writing of multiple independent jobs to the same data is not possible
  • Large datasets cannot fully benefit of the scaling of object stores

New Big Data formats have come to change this.

New Big Data Formats

Find here some of the more popular new Big Data Formats:

  • Apache Iceberg: Targets huge analytical datasets supported by a lot of compute engines
  • Apache Hudi: Targets incremental processing of large datasets
  • Delta Lake: Targets a „lakehouse architecture

Although all of them sound very different, we see that they have most often the same features in common with different nuances and support by software.

One characteristic of those new Big Data format that you need to specify a storage backend. This is often Apache Parquet. However, you cannot directly write those parquet files as specific metadata is required. I will explain this later, but for now:

Always use the official Big Data format connectors. Never write data to the storage backend directly – it will corrupt your data.

Note: I excluded Apache Kudu here as it is not a pure file format, but requires to setup a custom cluster.

New features of Big Data Formats

The new Big Data Formats mentioned in the introduction introduce new features to address those aspects (state: 03.08.2023 – this may change). The following table describes them and provides links to the relevant documentation.

FeatureDescriptionSupported by
Individual rows insert,update,delete and ACID transactionsIndividual rows can be inserted, updated, deleted without rewriting a full partition and ACID transactions are supported. This can be supported in two flavors:
Copy on Write (CoW): Files are synchronously merged during writes
Merge on Read (MoR): Updates are written to delta files and handled during queries. Can be compacted (merged) in regular intervals
Apache Iceberg (Merge on Read, Copy on Write), Delta Lake (Merge on Read), Apache Hudi (Merge on Read, Copy on Write, Note: Merge-on-read only accesses compacted data, ie new data not compacted is not visible until compacted)
Avoid rewriting full data on insert/update/delete,Improve writing/reading of tables with individual changes avoiding to rewrite the full dataApache Hudi, Apache Iceberg, Delta Lake
In-place schema EvolutionAdd, drop, rename, update the data schema in-place without costly reorganizationApache Iceberg, Apache Hudi, Delta Lake (experimental)
Large number of partitionsSupport millions of partitions contrary to standard Big Data formats, because newer formats are not depending on Hive metadata databaseApache Iceberg, Apache Hudi, Delta Lake (unclear, uses Hive metadata store)
Automated rewriting of queries for partitionsAutomatically optimize queries to use partitions without the need to explicitly mention the partition when filtering dataApache Iceberg, Apache Hudi (unclear probably not), Delta Lake (unclear, probably not)
Partition evolutionChange the partition scheme in-place without immediate rewritingApache Iceberg
Time-travelAccess previous versions of the data by providing a point in time and/or a snapshotApache Iceberg, Apache Hudi, Delta Lake
Data versioningExtension of time-traveling to allow branching and tagging of dataApache Iceberg
Optimized for object storesTakes into account object stores, such as S3 or Swift, and avoid renaming/moving of data once written as well as listing files. Additionally takes into account object store parallel access patterns (e.g. S3 prefix)Apache Iceberg (see also custom
ObjectStoreLocation
Provider,
such as S3), Apache Hudi, Delta Lake
Currency Control: Multi Version Concurrency Control/Snapshot IsolationSupport one writer with multiple concurrent readersApache Iceberg, Apache Hudi, Delta Lake
Currency Control: OptimisticSupport multiple concurrent writers/readers. Usually requires additional components for the lock management (e.g. DynamoDB on AWS)Apache Iceberg, Apache Hudi, Delta Lake
Sort during writing for optimised performanceSorting data during writes on columns which are used for filtering during read reduces significantly the files/data to be processed Apache Iceberg, Apache Hudi, Delta Lake (unclear, probably not)
Z-Ordering (Morton ordering) of dataIf you want to preserve data-locality for certain columns (e.g. a grouping, such as industry sector and revenue group (low, medium, high)) then z-ordering can ensure this by collapsing multiple dimensions into one preserving data locality. This will speed up queries significantly as all relevant data for a query is stored next to each other. This is also useful for Internet of Things (IoT) use cases where data with different distributions (ie not „naturally“ correlated) need to be colocatedApache Iceberg, Apache Hudi, Delta Lake
Features of new Big Data fileformats

Even if a format does not support certain features you should not discard it for use as those quickly evolve and new/different features might be supported soon.

These features allow also to tune the format according to data access patterns. For example, if you update/delete individual records frequently in irregular intervals then you may choose merge on read. This avoids direct compaction of data and saves thus time during writing, but may impose a small performance penalty during reading. You can schedule compaction jobs at times of low activity to remove the small performance penalty for reading. This can be also very useful for data streamig jobs.

On the other hand, if you update/delete individual records rarely or mostly in batches then can choose copy-on-write, which compacts data directly and takes longer during writing. However, during reading you save time as compacted data is faster to read.

In both modes, it is recommended to batch individual inserts,updates, deletes to get maximum performance.

Please note that even though atomic read/writes with high isolation are supported, those are not intended to replace relational databases, ie online transaction processing (OLTP), for transactional non-analytical systems, such as SAP ERP.

Nevertheless those atomic read/writes are still for analytical scenarios, such as updating individual code lists, reference data or for data streaming scenarios where events can enter the data processing pipeline at any frequency at any time.

Storage Backends

The new Big Data formats essentially cover metadata aspects around the data to support their new features. The data itself is stored in different formats also referred to as storage backends.

The data is often stored in Parquet, Orc or Avro format, which is referred to as storage backend in the novel Big Data formats. Some formats, such as Apache Iceberg, support additionally custom storage backends. However, you must always use the Big Data format specific connector (e.g. Iceberg, Hudi, Delta Lake) – NEVER write directly to Parquet, ORC or Avro if you want to use the novel Big Data formats. You will get corrupt and incorrect data.

Selecting a storage backend depends heavily on your access patterns (e.g. write-intensive vs. read-intensive vs row-based vs table-based). Note that only Apache Iceberg supports Parquet, Orc and Avro. Another criteria might be also migratability between novel Big Data format, where it seems that for those cases the data needs to be current stored in Parquet storage backend (again: do not confuse it with the Parquet format).

Hudi is based on Apache Parquet for copy-on-write tables and a mixture of Apache Avro/Parquet for merge-on-read tables.

Delta Lake is based only on Parquet and some proprietary binary format for delete data as well as json files for changed data.

Operational Aspects

There are not so many operational aspects to be considered using the new Big Data formats. We already described above schema/partition evolution (where supported). Of course you may need to adapt your data processing pipelines, if you adapt the schema.

If you use merge-on-read then you need to schedule frequently in-place compaction, which is in all formats a simple SQL command executed in any Big Data engine supporting the format.

Apache Iceberg also supports scheduling compaction of meta data.

Object storages, such as AWS S3, may have more efficient ways to delete data (e.g. using S3 Lifecycle policies) – those may need to be configured (see for Apache Iceberg or Deltalake)

Certain cloud provider object storages, such as AWS S3, may require additional services for optimistic locking (e.g. AWS DynamoDB), which may need to be configured (and possibly regularly checked to ensure no outdated locks are placed). For example, Apache Iceberg.

Migrating Data

Nearly all new Big Data formats support migrating data from other new Big Data formats or „traditional“ Big Data formats, such as Parquet, Avro. They often provide dedicated functionality as SQL commands to migrate the data (see e.g. Apache Iceberg). This means that the traditional Create Table as Select (CTAS) should not be used. You must also avoid to copy simple files (e.g. parquet files to the Iceberg folder) as this may lead to incorrect data, because the metadata is not correctly written.

You can also migrate new Big Data formats to another new Big Data format (e.g. Delta Lake to Apache Iceberg). This then can also ensure migration of metadata, such as snapshots, to ensure that features, such as time-travel, can be reused (see e.g. Apache Iceberg).

Support for (Technical) Data Catalogs

The novel Big Data formats require a lot of additional metadata to be stored alongside the data to support its features. While most of the metadata itself are files (e.g. in JSON format), some features may require a meta data store or data catalog to store the technical metadata. This means, for instance, that it stores the table metadata so that it can be retrieved by Big Data engines automatically. Do not confuse this with a business data catalog (e.g. Apache Atlas or Datahub) that targets business users so they can easily find data using business and semantic descriptions. Those business data catalogs are used complementary to technical data catalogs.

Different Big Data engines (see next section) support different type of data catalogs.

Find here some (technical) meta data catalogs and their support by the file formats:

Metdata CatalogSupport
Hive Metastore (virtually supported by all Big Data engines – not only Hive)Apache Iceberg, Apache Hudi, Delta Lake (via Spark data catalog)
Spark data catalogApache Iceberg, Apache Hudi, Delta Lake
Nessie data catalog (git-like versioning supports many Big Data engines)Apache Iceberg
AWS Glue Data Catalog (commercial)Apache Iceberg, Apache Hudi, Delta Lake
Azure Synapse Workspace Catalog (commercial)All of Spark data catalog
Google Big Query Data Catalog/Big Lake Data Catalog (commercial)Apache Iceberg, Apache Hudi, Delta Lake (via Dataproc metastore, ie Spark data catalog)
Custom (can be used through the format-specific connector for the Big Data engine)Apache Iceberg (JDBC and Rest), Apache Hudi (Hudi Metastore)
(Technical) Metadata Catalog and support by formats

Note: Certain type of concurrency support requires locking mechanisms that are implemented using databases, such as AWS DynamoDB. For instance, see here for Apache Iceberg DynamoDB locking.

Support for new formats in Big Data Technology

New file formats need to be supported by Big Data engines to make them usable. The more engines support the format the easier it is to migrate to other engines. Of course, the engines themselves need to be designed to leverage also the new features by those formats, which is often only the case if you use more recent versions of those engines. Additionally, you ideally use a metadata catalog supported by the engines you need it for (see previous section).

Find here some popular engines in alphabetical order and their support for new formats:

EngineSupport formats
Apache FlinkApache Iceberg, Apache Hudi, Delta Lake
Apache HiveApache Iceberg, Apache Hudi, Delta Lake (read-only)
Apache ImpalaApache Iceberg (only V1), Apache Hudi (experimental read-only)
Apache SparkApache Iceberg, Apache Hudi, Delta Lake
AWS Athena (commercial Trino)Apache Iceberg, Apache Hudi, Delta Lake (read-only, no time-travel)
AWS Glue (commercial Spark)same as Apache Spark
Azure Synapse Spark (commercial Spark)same as Apache Spark
Azure Synapse SQLsupports generally only few formats, Delta Lake
Dremio (commercial)Apache Iceberg, Delta Lake (read-only), Apache Hudi (only indirectly via Hive Connector)
Google Big Query (commercial)Apache Iceberg, Apache Hudi (Hive-style partitions)
Google Dataproc Serverless (commercial Spark)same as Apache Spark
Snowflake (commercial)Apache Iceberg, Delta Lake (experimental)
Trino (formerly known as Presto)Apache Iceberg, Apache Hudi, Delta Lake
Popular Big Data Engines and Support for new Big Data Formats

Permission management, e.g. at row level, depends on the engine used and not on the file format.

Delta Lake is the only format that is specific to a company (Databricks), while Apache Hudi and Apache Iceberg are supported by multiple companies/institutions/individuals. This means the latter are more receptive to changes/innovation by the community and are a bit more attractive to be supported by more Big Data Engines. While Delta Lake is open source under the Linux foundation, it may have certain features that are only available in the commercial Databricks offer. However, more and more has been released as open source recently.

Apache Iceberg has been usually often the first format supported by Big Data engines and/or with most of its features. Only recently other formats kept up with this.

There are also many other projects that support at least one of these formats.

I did not mention here tools that support using these file formats in a local setting, i.e. outside a Big Data engine. For instance, Python/Pandas supports Apache Iceberg or delta-rs for support in Rust/Python.

Novel use cases for Big Data formats

As we have seen the support for the novel Big Data formats is already very huge in Big Data engines – but what use cases do they have?

If you use already Apache Parquet, Apache Orc and Apache Parquet and you do not have any operational issues/complexity – then there is no immediate need to change to the novel formats – they are used in millions of production installation for petabytes of data.

The novel formats become relevant for the following use cases:

  • You have many partitions and users should be able to leverage them without specific knowledge of the partitions (Apache Iceberg)
  • Evolve partitions in-place (Apache Iceberg) – especially relevant if queries over time change and new ways of using the data emerge
  • Schema evaluation of your data in-place. Note: If you have semi-structured data (e.g. JSON with different schemas) then you may consider a NoSQL Database (document database) instead
  • Insert/Update/Deletes of individual rows with concurrent read/writes
  • High-performance ingestion of streaming data and making it available to end users for Big Data analytics
  • Change Data Capture of relational databases
  • Improved performance on object stores
  • Implement data privacy regulations for private data (e.g. General Data Protection Regulation (GDPR)) – for instance right to be forgotten (data of individual data subjects needs to be removed)
  • Many more

Which one to choose?

The novel Big Data formats evolve continuously – new features might be added soon. Nevertheless, they are stable for production use. It depends on your software ecosystem which one is best supported for you. Certain formats, such as Apache Iceberg, have unique features that other formats do not have (yet).

You should always do performance tests with realistic test data volumes right from the start of your project. This allows you to fine-tune the formats towards your data access patterns and to fine the right format for your use cases. Keep in mind that also performance characteristics often improve rapidly with new releases of the file format.

I recommend to design your analytics platform that all of the formats mentioned here are supported – your analytics users should be able to choose the best format for their use case. Additionally, you should make sure that you can easily convert between the formats. As sketched above – this cannot be always done only with simple CTAS.

However, even if your analytics platform should support all formats, I recommend to define one of them as the default format for your data in your enterprise – to avoid mixing related datasets in different formats, which can cause performance issues as well as compatibility problems.

Finally, some formats can have a more open ecosystem compared to other formats which are dominated by one company (e.g. Delta Lake).

Outlook

We will see in the future that the formats try to be on par with the features of other novel Big Data formats. Also more Big Data engines and commercial offers tend to support all novel file formats presented here.

Additionally, we can expect support for encryption (e.g. Apache Iceberg) of parts of the data in the format as an additional level of defense and access management. Those will integrate with (hardware-based) key management services of cloud provider.

Geospatial data is more and more important for Internet of Things and climate related data. Geoparquet for instance provides the possibility to manage vector-based geospatial data. Third party open source projects already try to bring this to novel Big Data formats (e.g. Geolake to Apache Iceberg). Especially for the given scenarios the additional features of novel Big Data format seem to enable new use cases that are not directly possible with the „traditional“ Big Data formats (despite e.g. Geoparquet supporting geospatial data).

Certain industries may also see support for cross-organisational data analytics in different clouds. This is in early stages and technically speaking not file format specific, but requires additional support for Big Data engines (e.g. see early implementations of delta-sharing). Those cross-organisational data analytics is not centrally defined and enforced, but works in a more distributed data mesh fashion.


Kommentare

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert