Understanding How to Adopt Hadoop with GPU
What is Apache Hadoop?
Apache Hadoop is a framework for storing large Data in distributed mode and distributed processing on that large datasets. It scales from a single server to thousands of servers. Hadoop detects the failures at the application layer and handles that failure. Hadoop 3.0 is major release after Hadoop 2 with new features like HDFS erasure coding, improves the performance and scalability, multiple NameNodes and many more.
How Hadoop with GPU Works?
GPU's are becoming a vital tool for many Big Data apps. There are many apps which rely on GPU Deep Learning, Machine Learning, Data Analytics, Genome Sequencing, etc. In many cases, GPU's speed up to 10x and its speed increases by 300x.
While starting Apache Hadoop 3.0, there is support for operators as well as the admins to configure YARN clusters and to schedule, use GPU resources.
Without native and more comprehensive applications require GPU support, there is no isolation of GPU resource also. Recognize GPU as a resource type while doing scheduling.
With GPU scheduling support, containers with GPU request placed to machines or local with enough available GPU resources. To solve the isolation problem, GPU uses multiple machines at the same time without affecting each other.
WEB UI of YARN includes GPU information. It shows the total used and available resources across the cluster among other resources like CPU & Memory.
Benefits of Hadoop with GPU
- Minimum required Java version increased from Java 7 to Java 8.
- Support for erasure encoding in HDFS.
- YARN Timeline Service v.2.
- Shell script rewrite.
- Shaded client jars.
- Support for Opportunistic Containers and Distributed Scheduling.
- MapReduce task-level native optimization.
- Support for more than 2 NameNodes.
- Customize Default ports of multiple services.
- Intra-data node balancer.
How to Adopt Apache Hadoop with GPU?
Adopt Hadoop with GPU by using Hadoop 3.0 Installation on the server and also add GPU on the Server. Hadoop runs on Unix/Linux based Operating Systems. However, it works with Windows-based machines, but it is not recommended.
There are three different modes of Hadoop Installation -
- Standalone Mode.
- Pseudo-Distributed Mode.
- Fully Distributed Mode.
- Hadoop default mode.
- HDFS not utilized.
- The local file system used for input and output.
- Used for debugging purpose.
- No Configuration required in 3 Hadoop(mapred-site.xml,core-site.xml, hdfs-site.xml) files.
- Faster than the Pseudo-distributed mode.
This configuration requires three files -
- HDFS needed only one Replication factory.
- In this one node used as Master Node / Data Node / Job Tracker / Task Tracker.
- Real Code to test in HDFS.
- Pseudo-distributed cluster to run all Daemons on one node itself.
Fully Distributed Mode
- This mode is a Production Phase mode.
- In this, data used and distributed across many nodes.
- In this mode, different Nodes used as Master Node / Data Node / Job Tracker / Task Tracker.
- Most of Companies Prefer Fully-distributed mode.
Prerequisites for Implementing Hadoop with GPU
- OPERATING SYSTEM - Ubuntu 16.04
- JAVA - Java 8
- Hadoop - You require Hadoop 3.0 package
- Download the Java 8 Package and Save this file in your home directory.
- Extract the Java TarFile.
- Download the Hadoop 3.0 Package.
- Extract the Hadoop tar File.
- Add the Hadoop and Java paths in the bash file (.bashrc).
- Edit the Hadoop Configuration files.
- Open core-site.xml and edit the property inside configuration tag.
- Edit hdfs-site.xml and edit the property inside configuration tag.
- Edit the mapred-site.xml file and edit the property inside the configuration tag.
- Edit yarn-site.xml and edit the property inside configuration tag.
- Edit Hadoop-env.sh and add the Java Path.
- Go to Hadoop home directory and format the NameNode.
- Once the NameNode is formatted, go to Hadoop-3.0/sbin directory and start all the Daemons.
- All the Hadoop services are up and running.
- Now open the Browser and go to localhost:50070/dfshealth.html to check the NameNode interface.
- Also, require used GPU devices on the Server. GPU with Hadoop used because it provides faster results, takes less time to process the data.
Why Hadoop with GPU Matters?
GPU usage with Hadoop in the cluster increases the processing speed of the tasks.
- Minimum Runtime Version for Hadoop 3.0 is JDK 8
- Support for Ensure Coding in HDFS
- Hadoop Shell scripting rewrite
- MapReduce task Level Native Optimization
- Introduces more powerful YARN in Hadoop 3.0
- Agility & Time to Market
- Total Cost of Ownership
- Scalability & Availability
- Increase Processing Speed
- Provides Result Faster
Best Practises of Hadoop with GPU
- Better-quality commodity servers to make it cost-efficient and flexible to scale out for complex business use cases.
- It is one of the best configurations for this architecture is to start with SIX core processors, 96GB of memory and 104TB of local hard drives.
- By using the above configuration, it gives faster and efficiently processing of data, moves the processing close to data instead of separating the two.
- By the above Config Hadoop scales and performs better with local drives so use Just a Bunch of Disks (JBOD) with replication instead of a redundant array of independent disks (RAID).
- For multi-tenancy, by sharing the compute capacity with the capacity scheduler and share HDFS storage, design the Hadoop architecture.
- Do not change with the metadata files as it can corrupt the state of the Hadoop cluster.
Concluding Hadoop with GPU
- Introduced GPU into MapReduce cluster and obtained up to 20 times speedup.
- Reduced up to 19/20 power consumption with the current preliminary solution and workload.
- Compared with upgrading CPUs and adding more nodes, deploying GPU on Hadoop has the high cost-to-benefit ratio.
- Provided practical implementations for people wanting to construct MapReduce clusters with GPUs.