Xenonstack Recommends

Rust for Big Data and Parallel Processing Applications

Acknowledging Data Management
          Best Practices with DataOps


Why Rust for Big data and Parallel Processing Applications?

Big Data is a high field growing swiftly, obviously because of its wide range of use cases and results of such precision, which results in its high efficiency, and because of all this, its demand in other fields too, like in IoT, AI, and Data analytics. There are a lot of products and projects that have been built using existing tools and architecture and are very stable. However, that does not mean that we have figured out everything about it; it does not mean that we do not have any scope for improvement in this field.

Problems with existence system

Most of the tools that are highly accepted and adopted in Big Data are JVM based. And JVM has its limitations -
  • JVM abstracts the hardware from the developer. As a result, JVM can rarely achieve near-native speed.
  • A garbage collector is a part of JVM, which helps developers to focus more on main business logic and less on memory management. Garbage collector with its default configurations works fine but becomes disruptive when it gets tweaked a little according to new requirements. Some of the tools have developed a by-pass option for this inefficiency. For example, Apache Flink has developed its off-heap Memory Management. Also, Apache Spark has similar Off-Heap Memory Management using Project Tungsten.
  • JVM has a huge memory footprint. Due to that, Java is hard to scale down. And Linkerd has moved from Scala+Finagle+Netty stack to Rust.
  • Java requires Serialization and De-serialization to send the data from one process to another processing unit, which is a bit slow and also prone to security vulnerabilities.
  • Shared memory concurrency is hard to program in Java and is more prone to a data race condition.

Ongoing Projects in Big Data and Parallel Processing that use Rust

Apache Spark is one of the most popular and widely used tools in Big Data. Which is a JVM based tool. So it has all the pros and cons of JVM. On which, Andy Grove, in his blog Rust is for Big Data, defined how he had an experience in this field, and he found some things that could be the source of more efficiency. He had his daily job of building distributed data processing jobs with Apache Spark. Also, he shared some areas where some brilliant engineering has gone into Spark to handle some of the efficiency issues and how they made Apache Spark (a JVM based tool) make less use of JVM. Moreover, according to him, if Apache Spark would have been built in a language like Rust, then it would be more efficient. So he started an open-source project back later named `Data Fusion,` and after some time, that project gets accepted by the Apache community itself. So now, DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data, and is being developed by a whole lot of Apache community and Rust community. There is also a paper-based blog which describes the project Weld, which is a Rust based system that generates code for the data analysis workflow, which runs efficiently in parallel using the LLVM compiler framework. According to the researchers from MIT CSAIL, the pipeline spends more time moving data back and forth between pieces than actually doing work on it. So, Weld creates a runtime that each library can plug into, providing a standard method to run critical data across the pipeline that needs parallelization and optimization. They describe the Weld as a common runtime for data analytics that takes the disjointed pieces of a modern data processing stack and optimizes them in concert.
So, as numerous projects are running on these systems, they do not ask developers to change their usage, rather they expect the framework maintainers and builders to work using Weld. They have also explained that there was a great deal of trade-off in speed and safety, which was solved by using Rust as the programming language. Python has been the second option as a programming language for the freshers that are trying to enter the field of Data Science and ML and is almost to be the first option because it is easy to learn and hugely dynamic. Yet, a subset of the community believes that Python can get faster and achieve more effective if the interpreter for the Python language could be written in Rust. So, there is also an open-source project that is being developed,RustPython.`
For now, it depends on CPython, but the goal of two things. One, A Python-3 interpreter completely written in Rust without any bindings and second, an interpreter free from any compatibility hacks. With this, developers could learn Python with the same ease, and get the benefit from security features that Rust has to offer.

Why we need Rust for Big Data and Parallel Processing?

Rust is still in young age, despite that, it has a lot to offer because of its being a low-level language, that interacts with the CPU directly and has a great defense system against some bugs and all errors, a high efficiency in threading (which crustaceans named it as `Fearless Concurrency`), some great unique concepts of 'Ownership' and 'Lifetimes,' which allows programmers to define each detail at their own without repeating themselves. So, because of all this, it does not require any garbage collection, so it does not have any. It is built and proven to be highly efficient in memory management, thread management, and fast in performance. It can easily integrate with other languages and have a lot to come in the future.
There is a lot of work going on in almost every field. Rust is a bit hard to learn because of its unique concepts and a different approach to OOPS then most of the languages as Rust follows the 'trait-based' approach to OOPS, according to which structures are not bound with behaviors, behaviors are bind with traits which structures can follow for or can make their trait. This is the new level of abstraction that has provided a lot to learn in the field of programming. Rust is not perfect, and there are some downsides too. Primarily is because of its security feature. Rust takes security in memory and thread management very seriously, because of which if your code has a possibility of any security leak, it will not allow you to compile the code to binary. Rust will not resolve the security issues for us, but it will not let us build the binary until we can guarantee that there is nothing wrong that can happen. In the latest releases, Rust has also added a `nightly` compiler, which will build the binary if it has security leaks too, only if we have added a line in our code, where we allow code to be unsafe. However, the community defines it as a feature of the Rust language and one of the motives behind developing it, so they said that it would not be removed from Rust. Instead, the programmer needs to be more careful about the program the are writing.

Rust Tools and Frameworks

There exist a lot of tools that provide the best solution to well-known problems or use cases, such as `serde` is a crate for Rust that helps, serialization and de-serialization of the Rust structures, and `rayon` which could help the developer to program parallel computations, perform sequential calculations with the API. It will handle the parallel computation of the action, with a guarantee to provide a data-race free solution. There also exist several crates that are in progress, still can be used in simple solution building. For example, Actix-Web, which is a web framework for Rust, Tokio, which is an event-driven, non-blocking I/O platform for writing asynchronous applications in Rust. The work in a field like big data is still in progress, but there are some crates the can be used like `JSON data` that can perform CRUD operations, selecting and sorting, and other services on json data directly. And `Diesel` - A safe, extensible ORM and Query Builder for Rust.

A Comprehensive Approach

To lean more about Building reliable and efficient software with fast Performance, enhanced Productivity we advise taking the subsequent steps -

Related blogs and Articles

Real Time Streaming Application with Apache Spark

Big Data Engineering

Real Time Streaming Application with Apache Spark

Apache Spark Overview Apache Spark is a fast, in-memory data processing engine with expressive development APIs to allow data workers to execute streaming conveniently. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Apache Hadoop. In...