Best Practices for Data and File Storage Management
What is Storage Format?
A storage format defines how information stored in a file or database. The extension of the file indicates this. Different data/file formats used by different Big data stack including Spark, Hadoop, Hive and others.
Standard File Formats
Some standard file formats are text files(CSV,XML) or binary files(images).
- Text Data - These data come in the form of CSV or unstructured data such as twitters. CSV files commonly used for exchanging data between Hadoop and external systems.
- Structure Text Data - This is a more specialized form of text files such as XML or JSON. JSON Processing is more challenging than XML because there are no tokens to mark the beginning or end of the record in JSON.
- Sequence Files - These files store data in a binary format with a similar structure to CSV. Sequence files store only data. Sequence files do support block compression.
Serialization is the process of turning data structure into byte stream either for storage or transmission over a network. Deserialization is the process of converting byte stream into data structures. Writable is the main serialization format used by Hadoop.Writables are fast and compact but difficult to extend or use from languages other than Java. Some serialization frameworks are being used within the Hadoop ecosystem.
- Thrift - Developed for implementing cross-language interfaces to services. It uses an IDL file to generate stub code to use in performing RPC clients and servers that communicate seamlessly across programming languages. It supports MapReduce.
- Protocol Buffers (protobuf) - It provides facilities to exchange data between services written in different languages. also defined via an IDL like thrift. It is neither splittable not compressible and also doesn't support MapReduce like thrift.
- Avro Files - It is a language-neutral data serialization system. Avro stores the metadata data with data. Avro supports the MapReduce. Avro data files blocked compressible and splittable. Avro files support the schema evolution which makes Avro better than sequence Sequence Files.
The columnar data format useful for recent modern Big Data application. Columnar format provides several benefits over row-oriented databases. When the need to accessing an only a small subset of columns then used columnar format database but when need to obtaining many columns then row-oriented database preferable.
There some columnar file formats -
- RC Files - This is a first columnar file format used in Hadoop. The RCFile format was developed to provide efficient processing for MapReduce applications. It developed to provide fast data loading, quick query processing, and highly efficient storage space utilization. RC file right for query but writing an RC file requires more memory and computation than non-columnar file formats. It doesn't support schema evolution.
- ORC Files - ORC files invented by HortonWorks to optimize performance in Hive. ORC files have the same benefits and limitations as RC files just done better for Hadoop. ORC files compress better than RC files, enables faster queries. It also doesn't support schema evolution.ORC specifically designed for Hive, cannot be used with non-Hive MapReduce interfaces such as Pig or Java or Impala. It contains a group of row data called stripes.
- Parquet Files - It is a columnar data format which is suitable for different MapReduce interfaces such as Java, Hive and Pig. It is also ideal for other processing engines such as Impala and Spark. Parquet is good as RC and ORC in performance but slower to write that other column formats. Parquet support schema evolution. Parquet support functionality to add new columns added at the end of the structure. Currently, Hive and Impala support query to add new columns, but other tools like Pig may face challenges.
Why Storage Format Matters?
A massive bottleneck for an HDFS-enabled application like Spark and MapReduce is time taken to find relevant data in a particular location and to write the data back to another location. Issues exacerbated with the difficulties managing large datasets, such as evolving storage constraints or schemas. Here is various file formats have involved as a way to ease these issues across some use cases.
Here are some significant benefits of choosing an appropriate file format -
- Faster read times
- Faster write times
- Splittable files
- Schema evolution support
- Advanced compression support
Some file formats designed for general use (like Spark or MapReduce), others are designed for more specific use cases, and some designed with specific data characteristics in mind. So there is a lot of choices.
How to Adopt File Format?
There three types of performance to consider when choosing a file format.
- Write performance - how to write data fast.
- Partial read performance - how to read individual column within a file.
- Full read performance - How to read every data element in a file.
A columnar, compressed file format like ORC or Parquet used for partial and full read performance, but these are expensive of write performance. Uncompressed CSV files fast to write but lack column-orientation and slow compression slow for reads.
Many options can be used to store data. If need to storing intermediate data between MapReduce jobs, then Sequence files are preferred. If query performance is most important, then ORC or Parquet are optimal, but these files take longer to write. If the schema is going to change over time, then Avro is best, but query performance will be slower than ORC or Parquet. CSV files are best if extract data from Hadoop to bulk load into a database.
Benefits of Choosing Right Storage Format
- Faster read times
- Faster write times
- Split files
- Schema evolution support (allowing you to change the fields in a dataset).
- Advanced compression support (compress the files with a compression codec without sacrificing these features).
Best Practises of File Storage Format
- When the need to accessing an only a small subset of columns then used a columnar data format.
- When necessary to obtaining many columns then used a row-oriented database instead of columnar database.
- If schema changes over time then use Avro instead of ORC or Parquet.
- If need to perform query then use ORC or Parquet instead of Avro.
- If need to perform column add operation then use Parquet instead of ORC.