Guide to Apache Hadoop 3.0 Best Practices and Benefits
What is Hadoop?
Apache Hadoop is a framework that allows storing large Data in distributed mode and allows for the distributed processing on that large datasets. It designs such a way that scale from a single server to thousands of servers. Hadoop itself designed to detect the failures at the application layer and handle that failure. Hadoop 3.0 is major release after Hadoop 2 with new features like HDFS erasure coding, improved performance and scalability, multiple NameNodes and many more.
Apache Hadoop 3.0 Benefits
- Support multiple standby NameNodes.
- Supports multiple NameNodes for multiple namespaces.
- Storage overhead reduced from 200% to 50%.
- Support GPUs.
- Intra-node disk balancing.
- Support for Opportunistic Containers and Distributed Scheduling.
- Support for Microsoft Azure Data Lake and Aliyun Object Storage System file-system connectors.
Why Hadoop 3.0 Matters?
- It involves minimum Runtime Version for Hadoop 3.0 is JDK 8.
- Ensures Coding in HDFS.
- Hadoop Shell scripting rewrite.
- MapReduce task Level Native Optimization.
- Introduces more powerful YARN in Hadoop 3.0.
- Agility & Time to Market.
- Total Cost of Ownership.
- Scalability & Availability.
How to Adopt Hadoop 3.0?
Hadoop runs on Unix/Linux based Operating Systems. However, it can also work with Windows-based machines which is not recommended.
There are three different modes of Hadoop Installation -
- Standalone Mode.
- Pseudo-Distributed Mode.
- Fully Distributed Mode.
- Hadoop default mode.
- HDFS not utilized.
- The local file system used for input and output.
- Used for debugging purpose.
- No Configuration has required in 3 Hadoop(mapred-site.xml,core-site.xml, hdfs-site.xml) files.
- This is much faster than the Pseudo-distributed mode.
- In this configuration is required in given three files for this mode.
- HDFS required only one Replication factory.
- In this one node will be used as Master Node / Data Node / Job Tracker / Task Tracker.
- Real Code used for the test in HDFS.
- Pseudo-distributed cluster is a cluster where all daemons are running on one node itself.
Fully Distributed Mode
- This mode is a Production Phase mode.
- In this data are used and distributed across many nodes.
- In this mode, different Nodes used as Master Node / Data Node / Job Tracker / Task Tracker.
- Most of Companies Prefer Fully-distributed mode.
Prerequisites for Implementing Hadoop
- OPERATING SYSTEM - Ubuntu 16.04
- HADOOP - You require Hadoop 3.0 package
- Step 1: Download the Java 8 Package and Save this file in your home directory.
- Step 2: Extract the Java Tar File.
- Step 3: Download the Hadoop 3.0 Package.
- Step 4: Extract the Hadoop tar File.
- Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
- Step 6: Edit the Hadoop Configuration files.
- Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag.
- Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag.
- Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside the configuration tag.
- Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag.
- Step 11: Edit Hadoop-env.sh and add the Java Path as mentioned below.
- Step 12: Go to Hadoop home directory and format the NameNode.
- Step 13: Once the NameNode is formatted, go to Hadoop-3.0/sbin directory and start all the daemons.
- Step 14: All the Hadoop services are up and running.
- Step 15: Now open the Browser and go to localhost:50070/dfshealth.html to check the NameNode interface.
Best Practises of Hadoop 3.0
Better-quality commodity servers to make it cost-efficient and flexible to scale out for complex business use cases. It is one of the best configurations for this architecture is to start with SIX core processors, 96GB of memory and 104TB of local hard drives.
Hadoop is a good way of configuration but not an absolute one.
By using the above configuration, it gives faster and efficiently processing of data, move the processing near data instead of separating the two.
By the above Config Hadoop scales and performs better with local drives so use Just a Bunch of Disks (JBOD) with replication instead of a redundant array of independent disks (RAID).
For multi-tenancy by sharing the compute capacity with the capacity scheduler and share HDFS storage we Design the Hadoop architecture.