A Complete Guide to RDD in Apache Spark

Overview of RDD in Apache Spark

Every Spark application is made-up of a Driver program which runs the primary function and is responsible for various parallel operations on the given cluster. The primary abstraction the Spark is the concept of RDD, which Spark uses to achieve Faster and efficient MapReduce operations.

Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.

XenonStack provides Platform Strategy and Integration solution with Data discovery, Data Catalog, Metadata Platform, mathematical and statistical techniques to build models for Prediction and Optimization of Business Outcomes.
Explore Our Services, Big Data Analytics Services and Solutions

Each Dataset in Spark RDD is divided into logical partitions across the cluster and thus can be operated in parallel, on different nodes of the cluster.

These RDDs can be created by deterministic operations on data on stable storage or other RDDs by either an existing Scala collection or with an external file in the HDFS(or any other supported file system).

Users can opt to persist an RDD in memory so that it can be reused multiple times efficiently. RDDs also have the potential to recover from faults occurrences in the system automatically.

Features of Resilient Distributed Dataset (RDD)

Lazy Evaluation 

All Transformations in the Apache Spark are lazy, which means that they do not compute the results as and when stated in Transformation statements. Instead, they Keep track of the Transformation tasks using the concept of DAG (Directed Acyclic Graphs). Spark computes these Transformations when an action requires a result for the driver program.

In-Memory Computation 

Spark uses in-memory computation as a way to speed up the total processing time. In the in-memory computation, the data is kept in RAM (random access memory) instead of the slower disk drives. This is very helpful as it reduces the cost of memory and allows for pattern detection, analyzes large data more efficiently. Main methods that accompany this are cache() and persist() methods.

Fault Tolerance 

The RDDs are fault-tolerant as they can track data lineage information to allow for rebuilding lost data automatically on failure. To achieve fault tolerance for the generated RDD’s, the achieved data is replicated among various Spark executors in worker nodes in the cluster.

Immutability 

As it is effortless to share the immutable data safely among several processes, it turns out to be a very valid option. Immutability simply rules out lots of potential problems due to various updates from varying threads at once. Having Immutable data is safer to share across processes. But, RDDs are not just immutable but also deterministic functions of their inputs which makes recreating the RDD parts possible at any given instance.

We can think of RDDs not only as a collection of data but a recipe for building new data from other data.

Partitioning 

RDDs are generally collections of various data items of massive volumes, that cannot fit into a single node and have to be partitioned across multiple nodes. Spark automatically does this partitioning of RDDs and distributes these partitions across different nodes.

Key points related to these partitions are 

  • Each node in a spark cluster contains one or more partitions.
  • Partitions in do not span multiple machines.
  • The number of barriers in Spark is configurable and should be chosen efficiently.
  • By increasing the number of executors on the cluster, parallelism can be increased in the system.

Location Setup capability 

RDDs are capable of clear placement preference to compute partitions. Placement preference refers to the defining information about the location of RDD. Here the DAG comes into play and places the partitions in a way that the task is close to the data it needs. Hence, the Speed of computation is increased.

Comparision: RDD vs Dataset vs DataFrame

RDD APIs

An RDD or Resilient Distributed Dataset is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster. These provide the functionality to perform in-memory computations on large clusters in a fault-tolerant manner. Every DataSet in the Spark RDD is well partitioned across many servers so that they can be efficiently computed on different nodes of the cluster.

DataSet APIs

In Apache Spark, the Dataset is a data structure in Spark SQL which is strongly typed, Object-oriented and is a map to a relational schema. It represents a structured query with encoders and is an extension to the Data-frame API. These are both serializable and Query-able, thus persisting in nature. It provides a single interface for both Scala and Java languages. It also reduces the burden of libraries.

DataFrame APIs 

We can say that Data-Frames are Dataset organized into named columns. These are very similar to the table in a relational database. The ideology is to allow processing of a large amount of Structured Data. Data-Frame contains rows with a schema where the schema is the illustration of the structure of data. It provides memory management and optimized execution plans.

Ways to Create RDDs in Apache Spark 

Three ways to create an RDD in Apache Spark 

Parallelizing collection (Parallelized) 

We take an already existing collection in the program and pass it onto the SparkContext’s parallelize() method. This is an original method which creates RDDs quickly in Spark-shell and also performs operations on them. It is very rarely used, as this requires the entire Dataset on one machine.

Referencing External Dataset 

In Spark, the RDDs can be formed from any data source supported by the Hadoop, including local file systems, HDFS, Hbase, Cassandra, etc. Here, data is loaded from an external dataset. We can use SparkContext’s textFile method to create text file RDD. It would URL of the file and read it as a collection of line. URL can be a local path on the machine itself.

Creating RDD from existing RDD 

Transformation mutates one RDD into another, and change is the way to create an RDD from an existing RDD. This creates a difference between Apache Spark and Hadoop MapReduce. Conversion works like one that intakes an RDD and produces one. The input RDD does not change, and as RDDs are immutable, it generates varying RDD by applying operations. 

Operation on RDD 

There are Two operations of Apache Spark RDDs Transformations and Actions. A Transformation is a function that produces a new RDD from the existing RDDs. It takes an RDD as input and generates one or more RDD as output. Every time it creates new RDD when we apply any transformation. Thus, all the input RDDs, cannot be changed since RDD are immutable. Some points are –

  • No Change to the cluster.
  • Produces a DAG which keeps track of which RDD was made when in the Life cycle.
  • Example : map(func), filter(func), Reduce(func), intersection(dataset), distinct(), groupByKey(), union(dataset), mapPartitions(fun), flatMap(). 

Types of Transformations 

  • Narrow Transformations: In this type, all the elements which are required to compute the records in a single partition live in that single partition. Here, we use a limited subset of partition to calculate the result. Narrow transformations are the result of map(), filter().
  • Wide Transformations: Here, all elements required to compute the records in that single partition may live in many of the partitions of the parent RDD. These use groupbyKey() and reducebyKey().

Spark Actions 

  • The Transformations in Apache Spark create RDDs from each other, but to work on actual Dataset, and then we perform action operations. Here, new RDD is not formed but gives non-RDD values as results that are stored on drivers or to the external storage system. It brings Laziness to the processing of RDDs.
  • Actions are a means of sending data from Executor to the Driver where the Executors are responsible for executing a task. At the same time, the Driver is a JVM process that manages workers and execution of the task.
  • Some Examples include : count(), collect(), take(n), top(), count value(), reduce(), fold(), aggregate(), foreach().

The flow of RDD in the Spark Architecture 

  • Spark creates a graph when you enter code in the sparking console.
  • When an action is called on Spark RDD, Spark submits graph to DAG scheduler.
  • Operators are divided into stages of Tasks in DAG scheduler.
  • The stages are passed on to the Task scheduler, which launches task through Cluster Manager.

Limitations of the RDD in Apache Spark

No automatic Optimization 

In Apache Spark, the RDD does not have an option for automatic input optimization. It is unable to make use of the Spark advance optimizers like the Catalyst optimizer and Tungsten execution engine, and thus we can only do manual RDD optimization. This is Overcome in the Dataset and DataFrame concepts, where both make use of the Catalyst to generate optimized logical and physical query plan. It provides space and Speed efficiency.

No static and Runtime type safety 

RDD does not provide static or Runtime type Safety and does not allow the user to check error at the runtime. But, Dataset provides compile-time type safety to build complex data workflows. This helps error detection at compile time and thus make code safer.  

The Problem of Overflow 

RDD degrades when there is not enough memory too available to store it in-memory or on disk. Here, the partitions that overflow from RAM may be stored on disk and will provide the same level of performance. We need to increase the RAM and disk size to overcome this problem.

Overhead of serialization and garbage collection (Performance limitation) 

As RDD is an in-memory object, it involves the overhead of Garbage Collection and Java serialization, which becomes expensive with growth in data. To overcome this, we can use data structures with fewer objects to lower cost or can persist object in serialized form.

No Schema View of Data 

RDD has a problem with handling structured data. This is because it does not provide a schema view of data and has no provision in that context. Dataset and DataFrame provide Schema view and is distributed collection of data organized into named columns.

Conclusion

The Hadoop MapReduce had a lot of shortcomings with it. To overcome these shortcomings, Spark RDD was introduced. It had in-memory processing, immutability and other functionalities mentioned above which gave users a better option. But RDD too had some limitations which restricted Spark from being more versatile. Thus, the concept of Data-Frame and Dataset evolved.

Get insight on Test-Driven Development for Big Data and Apache Spark



Leave a Comment

Name required.
Enter a Valid Email Address.
Comment required.(Min 30 Char)