Understanding How to Enable Apache Hadoop on Kubernetes

November 12, 2018 

Understanding How to Enable Apache Hadoop on Kubernetes

What is Apache Hadoop?

Apache Hadoop is a framework that allows storing large data in distributed mode and distributed processing on that large datasets. It is designed in such a way that it scales from a single server to thousands of servers. Hadoop itself intended to detect the failures at the application layer and handle that failure. Hadoop 3.0 is major release after Hadoop 2 with new features like HDFS erasure coding, improves the performance and scalability, multiple NameNodes and many more.


How HDFS Works on Kubernetes?

  • Wrap Namenode in a Service; Kubernetes pod uses a Service resource.
  • Kubernetes Service basically gives an IP/hostname in the cluster which load balances incoming requests across the selected pods.
  • The pods give NameNode pod a label say App - namenode and creates service i.e. selected pods with that labels.
  • Identify data node through Stateful Sets:- Stateful application such as Kubernetes provides another resource called Stateful Sets to help such applications.
  • In a Stateful Set, each pod gets identified by its name, its storage, and its hostname.
  • Run fully distributed HDFS on a single node - In the Kubernetes world, the distribution is at the container level. If more than one node, manage a dedicated disk, runs on a single node; its distributed. And now, a fully distributed HDFS runs on a single machine.

Benefits of HDFS on Kubernetes

  • Support multiple standby NameNodes.
  • Supports multiple NameNodes for multiple namespaces.
  • Storage overhead reduced from 200% to 50%.
  • Support GPUs.
  • Intra-node disk balancing.
  • Support for Opportunistic Containers and Distributed Scheduling.
  • Support for Microsoft Azure Data Lake and Aliyun Object Storage System file-system connectors.

Why HDFS on Kubernetes Matters?

  • Minimum Runtime Version for Hadoop 3.0 is JDK 8.
  • Support for Ensure Coding in HDFS.
  • Hadoop Shell scripting rewrite.
  • MapReduce task Level Native Optimization.
  • Introducing more powerful YARN in Hadoop 3.0.
  • Agility & Time to Market.
  • Total Cost of Ownership.
  • Scalability & Availability.

How to Adopt HDFS on Kubernetes?

Deploy an Application to Kubernetes -

  • Create Dockerfile.
  • Set Up a Cluster.
  • Connect to Cluster.
  • Add Cluster and Login Docker Registry.
  • Deploy Docker image.
  • Build and deploy an image.
  • a pull secret.
  • the image name and registry.
  • the ports to be used.
  • Deploy the private image to Kubernetes.
  • Automate the process Deployment to Kubernetes.

Some basic Kubernetes terminologies

  • Cluster
  • Node
  • Namespace
  • Deployment
  • Pod
  • Container
  • Service

Best Practices Of HDFS On Kubernetes

Keep the image small - Before start looking around for base images. An application requires a size not more than 15MB, using a 600MB image is wastage of resources. When less MB of the image used, it makes faster Container build using lesser space.

Use a single image - It is easy for the Pod that only one Container runs. It makes Pod performance better. When multiple Containers run in Pods is a mess to connect, manage and secure Microservices as these interrupt all communication.

Double check base image - Many of them make mistakes while selecting an Image. All things depend on the base image. There are lots of images present on Docker Hub, select the image as per the requirement of Project. Before using the base image to build a Docker image, double check Base image.

Use Namespaces and Labels - Proper define Namespaces and labels during deployment of the image. Inside Kube-cluster there is Virtual Cluster called Namespace isolated from one another. For selecting subsets of objects use Labels.

Use Non-Root user inside the Container - Always prefer to use Non-Root user inside Container because of security reasons. Non Root User has selected permissions for that Container.

Services and Pods - A service is responsible for making Pods discoverable inside the network or exposing them to the internet. A Pod hosts multiple Containers and storage volumes.

Be familiar with Kube Components - Multitude Components used to enhance performance, security, and reliability of Setup.

Wrap Namenode in a Service.

Identify data node through Stateful Sets.

Run fully distributed HDFS on a single node.


Tools for Enabling Apache Hadoop on Kubernetes