XenonStack Recommends

Big Data Engineering

RDD in Apache Spark Advantages and its Features

Chandan Gaur | 02 November 2022

Overview of RDD in Apache Spark

Every Spark application is made-up of a Driver program which runs the primary function and is responsible for various parallel operations on the given cluster. The primary abstraction the Spark is the concept of RDD, which Spark uses to achieve Faster and efficient MapReduce operations. Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.
XenonStack provides Platform Strategy and Integration solution with Data discovery, Data Catalog, Metadata Platform techniques to build models for Optimization of Business Outcomes. Explore Our Services, Big Data Analytics Services and Solutions
Each Dataset in Spark Resilient Distributed Dataset is divided into logical partitions across the cluster and thus can be operated in parallel, on different nodes of the cluster. These can be created by deterministic operations on data on stable storage or others by either an existing Scala collection or with an external file in the HDFS(or any other supported file system). Users can opt to persist an Resilient Distributed Dataset in memory so that it can be reused multiple times efficiently. It also have the potential to recover from faults occurrences in the system automatically.

What are the key features?

The key features of resilient distributed dataset are:

Lazy Evaluation 

All Transformations in the Apache Spark are lazy, which means that they do not compute the results as and when stated in Transformation statements. Instead, they Keep track of the Transformation tasks using the concept of DAG (Directed Acyclic Graphs). Spark computes these Transformations when an action requires a result for the driver program.

In-Memory Computation 

Spark uses in-memory computation as a way to speed up the total processing time. In the in-memory computation, the data is kept in RAM (random access memory) instead of the slower disk drives. This is very helpful as it reduces the cost of memory and allows for pattern detection, analyzes large data more efficiently. Main methods that accompany this are  cache()  and  persist()  methods.

Fault Tolerance 

The fault-tolerant as they can track data lineage information to allow for rebuilding lost data automatically on failure. To achieve fault tolerance for the generated RDD’s, the achieved data is replicated among various Spark executors in worker nodes in the cluster.

Immutability 

As it is effortless to share the immutable data safely among several processes, it turns out to be a very valid option. Immutability simply rules out lots of potential problems due to various updates from varying threads at once. Having Immutable data is safer to share across processes. But, these are not just immutable but also deterministic functions of their inputs which makes recreating the RDD parts possible at any given instance. We can think of its not only as a collection of data but a recipe for building new data from other data.

Partitioning 

There are generally collections of various data items of massive volumes, that cannot fit into a single node and have to be partitioned across multiple nodes. Spark automatically does this partitioning of Resilient Distributed Datasets and distributes these partitions across different nodes. Key points related to these partitions are 
  • Each node in a spark cluster contains one or more partitions.
  • Partitions in do not span multiple machines.
  • The number of barriers in Spark is configurable and should be chosen efficiently.
  • By increasing the number of executors on the cluster, parallelism can be increased in the system.

Location Setup capability 

These are capable of clear placement preference to compute partitions. Placement preference refers to the defining information about the location of it. Here the DAG comes into play and places the partitions in a way that the task is close to the data it needs. Hence, the Speed of computation is increased.
A product of Apache Software Foundation, which is in an open-source unified programming model and is used to define and execute data processing pipelines. Click to explore about our, Apache Beam Architecture and Processing

RDD vs Dataset vs DataFrame

The complete comparison described below:

RDD APIs

It is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster. These provide the functionality to perform in-memory computations on large clusters in a fault-tolerant manner. Every DataSet in the Spark Resilient Distributed Dataset is well partitioned across many servers so that they can be efficiently computed on different nodes of the cluster.

DataSet APIs

In Apache Spark, the Dataset is a data structure in Spark SQL which is strongly typed, Object-oriented and is a map to a relational schema. It represents a structured query with encoders and is an extension to the Data-frame API. These are both serializable and Query-able, thus persisting in nature. It provides a single interface for both Scala and Java languages. It also reduces the burden of libraries.

DataFrame APIs 

We can say that Data-Frames are Dataset organized into named columns. These are very similar to the table in a relational database. The ideology is to allow processing of a large amount of Structured Data. Data-Frame contains rows with a schema where the schema is the illustration of the structure of data. It provides memory management and optimized execution plans.
An open-source web framework that is used for building applications on AWS, Microsoft Azure, Kubernetes, etc. Click to explore about our, Serverless Solutions and Architecture for Big Data

Ways to create it in Apache Spark?

Three ways to create an RDD in Apache Spark are listed below:

Parallelizing collection (Parallelized) 

We take an already existing collection in the program and pass it onto the SparkContext’s parallelize() method. This is an original method which creates RDDs quickly in Spark-shell and also performs operations on them. It is very rarely used, as this requires the entire Dataset on one machine.

Referencing External Dataset 

In Spark, the Resilient Distributed Dataset can be formed from any data source supported by the Hadoop, including local file systems, HDFS, Hbase, Cassandra, etc. Here, data is loaded from an external dataset. We can use SparkContext’s textFile method to create text file. It would URL of the file and read it as a collection of line. URL can be a local path on the machine itself.

Creating RDD from an existing 

Transformation mutates one into another, and change is the way to create a new from an existing. This creates a difference between Apache Spark and Hadoop MapReduce. Conversion works like one that intakes an RDD and produces one. The input Resilient Distributed Dataset does not change, and as RDDs are immutable, it generates varying by applying operations. 

Operation on RDD 

There are Two operations of Apache Spark RDDs Transformations and Actions . A Transformation is a function that produces a new Resilient Distributed Dataset from the existing. It takes it as input and generates one or more as output. Every time it creates new when we apply any transformation. Thus, all the input, cannot be changed since these are immutable. Some points are -
  • No Change to the cluster.
  • Produces a DAG which keeps track of which it was made when in the Life cycle.
  • Example : map(func), filter(func), Reduce(func), intersection(dataset), distinct(), groupByKey(), union(dataset), mapPartitions(fun), flatMap(). 
Presto queries data including Hive, Cassandra, relational databases, separating computation from storage performing independent scaling. Click to explore about our, Large Data Processing with Presto

Types of Transformations 

  • Narrow Transformations: In this type, all the elements which are required to compute the records in a single partition live in that single partition. Here, we use a limited subset of partition to calculate the result. Narrow transformations are the result of map(), filter().
  • Wide Transformations: Here, all elements required to compute the records in that single partition may live in many of the partitions of the parent. These use groupbyKey() and reducebyKey().

Spark Actions 

  • The Transformations in Apache Spark create it from each other, but to work on actual Dataset, and then we perform action operations. Here, new Resilient Distributed Dataset is not formed but gives non-RDD values as results that are stored on drivers or to the external storage system. It brings Laziness to the processing it.
  • Actions are a means of sending data from Executor to the Driver where the Executors are responsible for executing a task. At the same time, the Driver is a JVM process that manages workers and execution of the task.
  • Some Examples include : count(), collect(), take(n), top(), count value(), reduce(), fold(), aggregate(), foreach().

The flow of it in the Spark Architecture 

  • Spark creates a graph when you enter code in the sparking console.
  • When an action is called on Spark, Spark submits graph to DAG scheduler.
  • Operators are divided into stages of Tasks in DAG scheduler.
  • The stages are passed on to the Task scheduler, which launches task through Cluster Manager.

Limitations of the Resilient Distributed Dataset in Apache Spark

No automatic Optimization 

In Apache Spark, it does not have an option for automatic input optimization. It is unable to make use of the Spark advance optimizers like the Catalyst optimizer and Tungsten execution engine, and thus we can only do manual Resilient Distributed Dataset optimization. This is Overcome in the Dataset and DataFrame concepts, where both make use of the Catalyst to generate optimized logical and physical query plan. It provides space and Speed efficiency.

No static and Runtime type safety 

It does not provide static or Runtime type Safety and does not allow the user to check error at the runtime. But, Dataset provides compile-time type safety to build complex data workflows. This helps error detection at compile time and thus make code safer.  

The Problem of Overflow 

It degrades when there is not enough memory too available to store it in-memory or on disk. Here, the partitions that overflow from RAM may be stored on disk and will provide the same level of performance. We need to increase the RAM and disk size to overcome this problem.

Overhead of serialization and garbage collection (Performance limitation) 

As it is an in-memory object, it involves the overhead of Garbage Collection and Java serialization, which becomes expensive with growth in data. To overcome this, we can use data structures with fewer objects to lower cost or can persist object in serialized form.

No Schema View of Data 

Resilient Distributed Dataset has a problem with handling structured data. This is because it does not provide a schema view of data and has no provision in that context. Dataset and DataFrame provide Schema view and is distributed collection of data organized into named columns.
A technique that is used to convert the raw data into a clean data set before the execution of the Iterative Analysis. Click to explore about our, Data Processing in ML

Conclusion

The Hadoop MapReduce had a lot of shortcomings with it. To overcome these shortcomings, Spark RDD was introduced. It had in-memory processing, immutability and other functionalities mentioned above which gave users a better option. But its too had some limitations which restricted Spark from being more versatile. Thus, the concept of Data-Frame and Dataset evolved.