What is Hadoop File System (HDFS)?
Hadoop File System (HDFS) is a distributed file system. All types of files can be stored in the Hadoop file system. All standard formats such as Gifs, text, CSV, tsv, xls, etc are supported by it. Beginners of Hadoop can opt for tab_delimiter (data separated by tabs) files because it is –
- Easy to debug and readable
- The default format of Apache Hive
What are Data Serialization Storage format?
Storage formats are a way to define how information is stored in the file. Most of the time, this information can be assumed from the extension of the dat`a. Both structured and unstructured data can be stored on HADOOP enabled systems. Common Hdfs file formats are –
- Plain text storage
- Sequence files
- RC files
Why Storage Formats?
- File format must be handy to serve complex data structures
- HDFS enabled applications to take time to find relevant data in a particular location and write back data to another location.
- Dataset is large
- Having schemas
- Having storage constraints
Why choose different File Formats?
Proper selection of file format leads to –
- Faster read time
- Faster write time
- Splittable files (for partial data read)
- Dchema evolution support (modifying dataset fields)
- Advance compression support
- Snappy compression leads to high speed and reasonable compression/decompression.
- File formats help to manage Diverse data.
Guide to Data Serialization in Hadoop
- Data serialization is a process to format structured data in such a way that it can be reconverted back to the original form.
- Serialization is done to translate data structures into a stream of data. This stream of data can be transmitted over the network or stored in DB regardless of the system architecture.
- Isn’t storing information in binary form or stream of bytes is a right approach.
- Serialization does the same but isn’t dependent on architecture.
Consider CSV files contains a comma (,) in between data, so while Deserialization wrong outputs may occur. Now, if metadata is stored in XML form, which is a self architected form of data storage, data can be easily deserialized in the future.
Why Data Serialization for Storage Formats?
- To process records faster (Time-bound).
- When Proper format of data need to be maintained and to be transmitted over data without schema support on another end.
- Now when in future, data without structure or format needs to be processed, complex Errors may occur.
- Serialization offers data validation over transmission.
Areas of Serialization for Storage Formats
To maintain the proper format of a data serialization system must have the following four properties –
- Compact – helps in the best use of network bandwidth
- Fast – reduces the performance overhead
- Extensible – can match new requirements
- Inter-operable – not language-specific
Serialization in Hadoop has two areas –
Inter process communication
When a client calls a function or subroutine from one pc to the pc in-network or server, that Procedure of calling is known as a remote procedure call.
It is better than java ‘s inbuilt serialization as java serialization isn’ t compact Serialization and Deserialization of data helps in maintaining and managing corporate decisions for effective use of resources and data available in Data warehouse or any other database -writable – language specific to java
Stacking of Storage Formats Hadoop
- Use splittable formats
- CSV files are lacking in block compression; using CSV files in HADOOP may lead to an increase in reading performance cost. Schema support is also limited..
- Use JSON records instead of JSON files. For better performance, experiments have to perform for each use case of JSON.
- AVRO file format and data serialization framework.
- Sequence files are complex in reading.
- Write operation is slower if RC (Row-Columnar) files are used.
- Optimized RC (ORC) files are also the option to be used from but have less support.
- Parquet Files are using columnar format to store data and process it.
Guide to Apache Hadoop File Format Configuration
- Hadoop distribution information and architecture type (system information where Hadoop system is working and jobs are processed) must be analyzed thoroughly.
- Perform Schema evolution by defining the structure of your metadata.
- Processing requirements for file formats and HADOOP system must also be configured.
- Engine support for reading and write operations.
- Storage size must be evaluated at a production level HADOOP File formats selection.
- Text files are lightweight but splitting of those files and reading data out of it leads to huge maps (MAPS needs to be created in a complicated way to achieve splitting of text (files data).
- Sequence files support block compression. As hive has SQL types, so not worthy of working with Hive.
- RCFILE has a high compression rate, but it takes more time to load data.
- ORC can reduce data size up to 75% and suitable with hive but increases CPU overhead. Serialization in ORC depends on data type (either integer or string).
- AVRO provides features like serialization and Deserialization with file format also. The core of AVRO is its schema. Supports dynamic and static types.
- Parquet provides the facility to partition data into both rows and columns but computationally slow on the write side.
- Avro is a data exchange and data serialization service system. It is based on schemas.
- Schema always get stored in AVRO file with data.
- Specific and Generic data are supported.
- JSON defines AVRO schema (most languages have JSON libraries), and data format is binary, that determines efficiency and make it compact.
- The default schema can be set for some mismatched or failed schema.
Apache Avro Solution Offerings
At the same time, a few questions arise that –
Is it possible to handle any schema change – Yes, in AVRO schema changes like missing fields, changed/modified fields and added/new areas are easy to maintain
What will happen to past data when the schema is changed – AVRO has the capability of reading data with new schema regardless of data generation time.
Is there any problem occurs while trying to convert AVRO data to other formats, if needed – There are some inbuilt specifications in the Hadoop system while using a Hybrid system of data. In such a way, Apache spark can convert Avro files to parquet.
Is it possible to manage MapReduce jobs – MapReduce jobs can efficiently be achieved by using Avro file processing in MapReduce itself.
What connection Handshake do – Connection handshake helps to exchange schemas between client and server during RPC.
How is AVRO different from other systems? (Thrift and protocol buffer)
- AVRO has Dynamic Schema stores for serialized values in binary format in a space-efficient way.
- JSON schema and data in the file.
- No compilation required can be parsed directly by using parser library.
- AVRO is a language-neutral system.
How is Avro implemented?
- Have the schema of data format ready
- To read schema in program
- Compile using AVRO (by generating class)
- Direct read via Parser library
- Serialization can be achieved through serialization API (java)
By creating class – A schema that is fed to AVRO utility and then gets processes it as java file. Now write API methods to serialize data.
By parser library – Fed schema to AVRO utility and access serialization parser library to serialize data
Deserialization can be achieved through deserialization API (java)
By generating class – Deserialize the object and instantiate DataFileReader class
By parser library – Instantiate parser class
Creating schema in Avro
There are three ways in which JSON schemas of AVRO can be defined and created –
- JSON String
- JSON object
- JSON Array
There are some data type that needs to be mentioned in schema –
- Primitive (common data types)
- Complex (when collective data elements need to be process and store)
- Primitive data types contain null, boolean, int, long, bytes, float, string, and double. In contrast, complex contains Records (attribute encapsulation), Enums, Arrays, Maps (associativity), Unions (multiple types for one field), and Fixed (deals with the size of data).
- Data types help to maintain sort order
- Logical types (a super form of complex data types) are used to store individual data that is involved, i.e., time, date, timestamp, and decimal.
Apache Avro Use Cases
- Schema evolution.
- Apture is using AVRO for logging matrices data.
- RichRelevance is using binary formatted key/value pairs.
Overview of Apache Parquet
Apache developed parquet, and it is a columnar storage format for Hadoop ecosystem. It supports all data processing frameworks. Also, it supports all data models or programming languages –
- Provides a compressed and efficient representation of columnar data.
- Record stripping and assembly algorithm from dremel paper.
- Compression schemes can be present to the per-column level.
- Metadata is written for single-pass writing after data.
- The file format contains column-wise data, which is split into row groups.
- Associations with data are done in three types of metadata forms
Files -Includes associations of schema element, principal value and row group which is connected to further column chunk.
Column (chunks) – Contains associations with KeyValue and connected with every ColumnChunk under RowGroup
Page Header – Contains info of size and associations of PageDataHeader, indexPage
Header and Dictionary PageHeader
Had a capability to reuse efficient encoding over primitive types utilizing Annotations that further can be decoded and interpreted and Support many engines such as Hive, Impala, spark, etc.
Apache Parquet Capabilities
- Task-oriented serialization of data can be performed if some column-based decisions need to be managed.
- Data in 3 rows like r1(c1,c2,c3), r2(c1,c2,c3), r3(c1,c2,c3) will be represented in columnar storage as 3 columns like c1(r1,r2,r3), c2(r1,r2,r3), c3(r1,r2,r3).
- Easy to split data partition in column way.
- As data is compressed and metadata is associated, modifications and new additions to schema may reflect changes immediately without interruption.
- The columnar format can store the same primitive types of values together, forming a nested structure of data (by using Definition and repetition levels).
- Columnar storage provides more homogeneous data so that compression rate will be better and best for Hadoop cluster.
- ProtoBuf model can easily be used to describe the data structure of the columnar format using a schema.
- Encoding and decoding of columns data make data more compact and compressed in nature.
Causes of error in Parquet.
- When any metadata is found corrupt, then it will cause particular associative data to be lost.
- While writing file metadata at the end of the file, if any error occurs, it may also cause loss of data.
- When the structure of the external table is changed, how metadata information can be updated?.
Apache Parquet Solution Offerings
- The approach used with sync markers in RC and AVRO can be used here. Write file metadata at some Nth RowGroup, so during error, partially written data can be recovered.
- Easy to integrate with AVRO, thrift, etc. Models.
- When the PURGE option is not provided, managed tables can be retrieved from trash for a defined duration.
Guide to Apache Hive
- Data warehouse software that provides fast processing of massive amount of data.
- Can be used for both Batch and Streaming queries.
- Read, write and manage large datasets of distributed systems using SQL.
- Requires schema to process structured data, follows OLAP.
- SQL typescripts are developed for MapReduce operations, known as HQL.
- Execution engine uses the technique of MapReduce to process the query and to generate results similar to MapReduce.
- To configure Metastore, an external database server is required (Apache Derby).
- Unstructured data first needs to be converted into structured form to process using Hive.
- HIVE has a high level of abstraction as compared to Hadoop MapReduce and code efficiency is very less.
Apache Hive Working Architecture
- First of all, a query is given to the driver to execute
- The driver then sends a request to the compiler and then requests for metadata from Meta store
- That metadata is then processed back to the driver. Then execution code is given to the execution engine, which processed data from the Hadoop system containing job tracker and HDFS with name nodes.
- After MapReduce finishes its processing of tasks and jobs, the result is then sent back to drivers and displayed to user interface (CLI, web UI, etc.).
- Hive organizes data into Databases (namespaces), tables (schemas), partitions (each unit defines its meaning) and clusters (known as Buckets).
- HIVE has user-defined Aggregations (UDAFs), user-defined table functions (UDTFs) and user-defined functions (UDFs) to expand Hive’s SQL.
Query Optimization with Apache Hive
- Use effective schema with partitioning tables
- Use compression on the map and reduce outputs
- Bucketing for increased performance on I/O scans
- MapReduce jobs with parallelism
- Vectorization process can be implemented for processing batch of rows together
- Hive has inbuilt support to SerDes and allows custom SerDes to be developed by users of Hive
- Shared and Exclusive locks can be used for optimizing resource use, and concurrent writers and readers are supported
Apache Hive Use Cases
- Amazon Elastic MapReduce.
- Yahoo is using Pig for data factory and Hive for data warehousing.
Apache Hive data types
- Column types (specific to columns only)
- Integral (TINYINT, SMALLINT, INT, and BIGINT)
- Strings (Varchar, Char)
- Timestamps (Date)
- Decimals (immutable arbitrary-precision as of JAVA)
- Literals (representation of fixed value)
- Decimals (values higher than DOUBLE)
- NULL type
- Complex Types (Arrays, Maps, Structs, Union)
- Hive also can change kind of data when metastore col.type.changes is set to false. If the conversion is successful, data will be displayed otherwise NULL.
Partitioning in Apache Hive
Partitioning is the technique to divide tables into parts based on columns data, the structure of the data, and the nature of the data based on its source of generation and storage –
- Partition keys help to identify partitions from a table
- Hive converts SQL queries into jobs of MapReduce to submit to Hadoop cluster
- Created when data is inserted
There are two different types of tables in the Hive –
- Manages data by default
- Security needs to be maintained at the Schema level
- Best for temporary data
- External warehouse directory needs to be mentioned when external tables are present
- Security must be applied at HDFS level
- When data is required even after DROP statement
- Execution load is horizontal
- When data is stored internally in a Hadoop system, then Dropping a table causes deletion of metadata and data tables permanently
Static partitioning in Apache Hive
- When data file size is significant, then static partitions are preferred only
- Big files are generally generated in HDFS
- From filename, partition column values can be easily fetched
- Partition column value needs to specify whenever Load statement is executed
- Each data file is individually inserted
- When the partition is to be defined for one or two columns only
- Alteration is possible on static partitions
Dynamic partitioning in Apache Hive
- Every row of data is read and then goes to a specific field in the table
- When ETL flow of data pipeline is required
- Partition column value needs not to be mention whenever LOAD statements are executed
- Only single insert
- Loads data from a non-partitioned table
- Takes more time in loading as data needs to be read row by row
- When the partition generation process isn’t static
- Alteration isn't possible, need to perform load operation again
Apache Hive Features
- Interactive shell with essential Linux commands support
- Beeline CLI, a JDBC client on SQLLine (used at HiveServer2)
- AvroSerde is used for Avro data in HIVE tables, HiveQL for ORC file formats
- Hive does not support some data types of AVRO, but AvroSerde automatically do the work for this
- Test cases of Various file formats in Hadoop
- Test cases outputs are directly or indirectly changed with test cases, data model usage and specification and configuration of hardware
Following are some test cases (based on CERN's blog) of Kudu, AVRO, Parquet, HBase –
- Based on space utilization
- Kudu and Parquet serve best compression size with Snappy and GZip algorithms
- Based on the ingestion rate
- AVRO and Parquet are best as compared with writing speed of others
- Based on Lookup latency of random data
- Kudu and Hbase are faster than others
- The scan rate of data
- AVRO has best scan rate of data when non-key columns are involved in output Test Cases Output
Using the hybrid approach in the Hadoop file system provides more efficient and Fast data processing with Structured or unstructured data for SerDes and Analytics on Data. The following are some useful points with some formats that can be used within the Hybrid approach.
- Parquet or kudu, when used with snappy compression, reduces the volume of data by ten times than simple uncompressed serialization format
- Random lookup operations case is best suitable for HBase or Kudu. Parquet is also fast, but resources use more.
- Analytic on data is the best use case for Parquet and Kudu
- HBase and Kudu can alter in-place data
- Avro is the quickest encoder for structured data. Best when all attributes of record are required
- Parquet works seamlessly for scan and ingestion
SerDe in Apache Hive
Hive has some built-in SerDe for Avro, ORC, parquet, CSV, JsonSerDe, and Thrift. Custom support is also available for users. Hive engines automatically handle object conversion. Two types of processing are done for serialization and Deserialization of records.
- The configured input format is used by Execution engine to read
- Then performs Deserialization
- that engine also holds object inspector
- Deserialized object and inspector then passed to operators by the engine to manipulate records
- Serialize method of Serde uses deserialized object and object inspector
- Then performs serialization on it in expected output format for writing
A Holistic Approach
To know more about how Why Adapting Hadoop Storage Format Matters and understanding how to process records faster we recommend taking the following steps –