Xenonstack Recommends

Apache Arrow and Distributed Compute with Kubernetes

Acknowledging Data Management
          Best Practices with DataOps

Subscription

What is Apache Arrow?

Apache Arrow is a cross-language development platform for In-Memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. Also, provides inter-process communication, zero-copy streaming messaging and also computational libraries. C, C++, Java, JavaScript, Python, and Ruby are the languages currently supported include. " as quoted by the official website. This project is a move to standardize the In-Memory data representation, used between libraries, systems, languages, and frameworks. Before Apache Arrow, every library, language had their way of representing data in native in-memory data structures. For example, look at the Python world -
Library In-memory representation
Pandas Pandas data frame
Numpy Numpy arrays
PySpark Spark add / data frame / dataset
  After Apache Arrow -
Library In-memory representation
Pandas Arrow format
Numpy Arrow format
PySpark Arrow format
It simplifies system architectures, improves interoperability and reduces ecosystem fragmentation. View it from the point of various programming languages; It's easy to port a dataset prepossessed in R to Python for Machine Learning without any overhead of serialization and deserialization.

How to Adopt Apache Arrow?

Extend Pandas using Apache Arrow Pandas are written in python whereas Arrow written in C++, using Apache Arrow with pandas solves the problems that occur with Pandas such as speed and leverages the modern processor architectures. Improve Python and Spark Performance and Interoperability with Apache Arrow This is a very great Big Data Processing framework also launched support for Apache Arrow. Spark exploits Apache Arrow to provide Pandas UDF functionality Vectorized Processing of Apache Arrow data using LLVM compiler Gandavia is an open-sourced project supported by Dreamio which is a toolset for compiling and execution of queries on Apache Arrow data. LLVM is a compiler infrastructure which supports various language compiler front ends and back end. Its main component is the conversion of code ( in any language ) to an intermediate representation (IR) which makes use of modern processor features such as Parallel Processing and SIMD.

Why Apache Arrow Matters?

To the time of writing the latest release is 0.11.0 - 8th of October 2018. Apache Arrow has a bright future ahead, and it's one of its kind in its field. It can be coupled with Parquet and ORC makes a great Big Data ecosystem. The adaption of Apache Arrow has been on rising since its first release in and adopted by Spark. The corporate has also welcomed Apache Arrow with open hands and adopted at organizations such as Cloudera, Dreamio, etc. Developers from more than 11 critical Big Data Open Source projects involved in the this project. Their team working on messaging framework named Arrow Flight based on gRPC and optimized to transfer data over the wire. It transfers data in Arrow format from one system to another even more comfortable. Integration with Machine Learning frameworks such as TensorFlow and PyTorch. It will be significant progress and further reduce training and prediction times. It comes at a perfect time because, with the use of new generation GPU, the bottleneck of data read speeds seems to be coming to an end.

How Apache Arrow Works?

The main features are - It is a columnar in-memory layout, which allows O(1) random access. This kind of layout is highly cached efficient in OLAP (analytical) workloads and also benefit from modern processor optimizations such as SIMD. Its data model supports various types of simple and complex data types that handle flat data structures as well as real-world JSON like nested workloads. For building high-performance query engines that run analytical workloads, Columnar data become the de facto format. It is an In-Memory columnar data format that houses canonical In-memory representations for both flat and nested data structures.

Best Practices of Apache Arrow

Leverage the columnar format, and Parquet for In-Memory and On-Disk representations respectively and make systems faster than before. First of all, locate any value of the interest in constant O(1) time by “walking” the variable-length dimension offsets. Data itself gets stored contiguous In-Memory, in this way it becomes highly cache efficient to scan the values for analytics purposes.

Benefits of Apache Arrow

  • Provides a common data access layer to all the applications.
  • More native data types supported as compared to Pandas such as date, nullable int, list, etc. Highly efficient I/O.
  • Zero-copy to another ecosystem like JAVA /R.
  • Optimized for data locality, SIMD, Parallel Processing.
  • Apache Arrow suited for SIMD ( single instruction, multiple data ) kinds of Vectorized Processing.
  • Accommodate both random access and scan workloads.
  • Low overhead while streaming messaging /RPC.
  • "Fully shredded"columnar, supports flat and nested schemas.
  • Support GPU as well.

Concluding Apache Arrow

It uses as a Run-Time In-Memory format for analytical query engines. It includes Zero-copy (no serialization/ deserialization ) interchange via shared memory. It sends large datasets over the network using Arrow Flight. It develops a new file format which allows zero-copy random access to on on-disk data example: Feather files. It’s used in Ray ML framework from Berkeley RISELab. To know more, you are advised to look into the below steps:

Related blogs and Articles

Modern Data Warehouse Services, Architecture and Best Practices

Enterprise Data Management

Modern Data Warehouse Services, Architecture and Best Practices

What is Modern Data Warehouse? Modern Data warehouse comprised of multiple programs impervious to User. Polyglot persistence encourages the most suitable data storage technology based on data. This "best-fit engineering" aligns multi-structure data into data lakes and considers NoSQL solutions for JSON formats. Pursuing a polyglot persistence dat strategy benefits from virtualization and takes...