Xenonstack Recommends

Guide to Hudi Architecture, Tools and Best Practices

Acknowledging Data Management
          Best Practices with DataOps


What is Apache Hudi?

Apache Hudi Stands for Hadoop Upserts and Incrementals to manage the storage of large analytical datasets on HDFS. The primary purpose of Hudi is to decrease the data latency during ingestion with high efficiency. Hudi developed by Uber is open source, and the analytical datasets on HDFS serve out via two types of tables, Read Optimized Table and Near-Real-Time Table. The primary purpose of reading Optimized Table is to provide the query performance through columnar Storage and on the other hand Near-Real-Time table provides queries on Real-Time (combination of Row-based storage and Columnar storage). Hudi is an Open Source Spark library for performing operations on Hadoop like the update, insert, and delete. It also allows users to pull only changed data improving the query efficiency. It further scales horizontally like any job and stores datasets directly on HDFS.

Understanding How Apache Hudi Works?

Hudi provides the following primitives over datasets on HDFS -
  • Upsert
  • Incremental consumption
Hudi maintains the timeline of all activity performed on the dataset, to provide instantaneous views of the dataset. Hudi organizes datasets into a directory structure under a basepath very similar to Hive tables. Dataset is broken up into partitions, folders contain files for that partition. Each partition uniquely identified by partition path, relative to the basepath. Each partition records distributed into multiple files. Each file has a unique file id and the commit that produced the file. Multiple files share same file id but written at different commits, in case of updates. Storage Type - it only deals with how data stored.
  • Copy on write.
  • Purely Columnar.
  • Create the new version of files.
  • Merge On Reading.
  • Near-Real-Time.
Views - And on the other hand Views deals with how is data read. Read Optimized View - Input Format picks only Compacted Columnar Files.
  • Parquet Query Performance.
  • ~30 mins latency for ~500GB.
  • Targets existing Hive tables.

Real-Time View

  • Hybrid of row & columnar data.
  • ~1-5 mins latency.
  • Brings Near-Real-Time Tables.

Log View

  • A stream of changes to the dataset.
  • Enables Incremental Pull.
Hudi Storage consists of three distinct parts - Metadata - It maintains the metadata of all activity performed on the dataset as a timeline, which allows instantaneous views of the dataset, stored under a metadata directory in the basepath. The types of actions in the timeline - Commits - A single commit capture information about an atomic write of a batch of records into a dataset. A monotonically increases timestamp, identifies the Commits, denoting the start of the write operation. Cleans - Clean the older versions of files in the dataset that will no longer be used in a running query. Compactions - It is foundation action to accommodate differential information structures. Index - It maintains an index to quickly map an incoming record key to a file if the record key is already present. Index implementation is pluggable - Bloom filter in each data file footer - It's preferred as the default option since there is no dependency on any external system. Index and data are always consistent with one another. Apache HBase: Efficient for a small batch of keys. It's likely to shave off a few seconds during index tagging. Data - Hudi stores ingested data in two different storage formats. The actual formats used are pluggable, but require the following characteristics -
  • Read-Optimized columnar storage format (ROFormat). The default is Apache Parquet.
  • Write-Optimized row-based storage format ( WOFormat). The default is Apache Avro.

Why Apache Hudi Important for Large Scale and Near Real-Time Applications?

Hudi solves the following limitations -
  • Scalability limit in HDFS.
  • Need for faster data delivery in Hadoop.
  • For updates and deletes for existing data, there is no Direct support.
  • Fast ETL and modeling.
  • To retrieve all the records updated, regardless of whether these updates are new records added to recent date partitions or updates to older data, Hudi allows the user the last checkpoint timestamp. This process also disables the process of executing a query which scans the entire source table.

How to Adopt Hudi for Data Pipelines with Apache Spark ?

Download Hudi

$ mvn clean install -DskipTests -DskipITs
$ mvn clean install -DskipTests -DskipITs -Dhive11

Version Compatibility

Hudi requires Java 8 installation. Hudi works with Spark-2.x versions.
Hadoop Hive Spark Instructions to Build Hoodie
Apache Hadoop-2.8.4 Apache Hive-2.3.3 spark-2.[1-3].x Use “mvn clean install -DskipTests”
Apache Hadoop-2.7.3 Apache Hive-1.2.1 spark-2.[1-3].x Use “mvn clean install -DskipTests”

Generate a Hudi Dataset

Setting the Environment Variable

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

export HIVE_HOME=/var/hadoop/setup/apache-hive-1.1.0-cdh5.7.2-bin

export HADOOP_HOME=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2

export HADOOP_INSTALL=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2


export SPARK_HOME=/var/hadoop/setup/spark-2.3.1-bin-hadoop2.7




Supported API'S

Use the DataSource API to quickly start reading or writing Hudi datasets in a few lines of code. Use the RDD API to perform more involved actions and a Hudi dataset.

Best Practices of Hudi

Use a new type of HoodieRecordPayload and keep the previous persisted one as the output of combineAndGetUpdateValue(...). However, the commit time of the previous persisted one updated to the latest, which makes the downstream incremental ETL counting this record twice. A left join data frame with all the persisted data by key and insert the records whose persisted_data.key are null. The concern is not sure bloomIndex /metadata taken full advantage. Put a new flag field in the HoodieRecord reading from HoodieRecordPayload metadata to indicate if a copyOldRecord needed during writing. Pass down a flag during data frame options to enforce this entire job will be copyOldRecord.

Benefits of Apache Hudi

  • Scalability limitation in HDFS.
  • Fast data delivery in Hadoop.
  • For updates and deletes for existing data, there is no Direct support.
  • Fast ETL and modeling.

Apache Hudi Vs. Apache Kudu

Apache Kudu is quite similar to Hudi; Apache Kudu is also used for Real-Time analytics on Petabytes of data, support for upsets. The primary key difference between Apache Kudu and Hudi is that Kudu attempts to serve as a data store for OLTP(Online Transaction Processing) workloads but on the other hand, Hudi does not, it only supports OLAP(Online Analytical Processing). Apache Kudu does not support incremental pulling, but Hudi supports the incremental pulling. There is also other main key difference, Hudi wholly based on the Hadoop compatible file system such as HDFS, S3 or Ceph and Hudi also doesn't have it own storage server. But on the other hand, Apache Kudu has its Storage servers which further talk to each other via RAFT. For heavy lifting, Hudi is dependent on Apache Spark, so that Hudi can be scaled easily just similar to other spark jobs.

Concluding Apache Hudi

Hudi fills a big void for processing data on top of HDFS, and thus mostly co-exist nicely with these technologies. Hudi is best to perform insert/update operation on parquet format data on top of HDFS.

Related blogs and Articles

Overview to Big Data Governance Tools, Benefits and Best Practices

Big Data Governance

Overview to Big Data Governance Tools, Benefits and Best Practices

What is Big Data Governance? Before proceeding, you must know What is Big Data? Big Data Governance is the process and management of data availability, usability, integrity, and security of data used in an enterprise. It includes all the steps from storing the data to secure it from any mishap. It is not just only about technology. Responsible for the particular data asset along with the...