Overview of Apache Hadoop Security
It is a platform or framework in which Big Data is stored in Distributed Environment and processing of this data is done parallelly. The capacity of Hadoop to handle structured and unstructured data provides users more flexibility for collecting, processing and analyzing data than that of provided by relational databases or warehouses.The Apache Hadoop is so widespread that it is adopted by many famous companies like Facebook, Yahoo, DataDog, Netflix, Adobe, etc. Apache Hadoop Security mainly consists of two components -
The place where all the data is stored in a cluster is HDFS (Hadoop Distributed File System). In general, we can see that HDFS is a single unit that is storing for storing all the Big Data, but in reality, all the data is distributed across multiple nodes in a distributed fashion. Master- HDFS follows slave architecture. There are two nodes used in HDFS, i.e. Namenode and Datanode. Namenode act as Master and Datanode is Slave. In general, we say that the Master node contains metadata or information regarding which kind of data is stored at which node while on the Other way actual data is stored in the Datanode. As we all know, failure rates of the hardware are pretty high so keeping this in mind we have to replicate the data in all the data nodes that can be used in case of emergency.
A framework that allows storing large Data in distributed mode and allows for the distributed processing on that large datasets. Click to explore about, Apache Hadoop 3.0 Features and Its Working
Now the data that is stored in HDFS is needed to be processed and necessary information is driven so this is done by the help of YARN (yet another resource, negotiator). YARN allocates resources and Schedules tasks to process this activity. YARN having its major components i.e. Resource Manager and Node Manager. Resource Manager is Master node that receives the request, and then he transfers this request to node Manager accordingly where there is actual processing takes place.
What is Apache Ranger?Apache Ranger is used to enabling, manage, and monitor the security of data across the Hadoop platform. The main aim of Ranger is to provide security across the Hadoop Ecosystem. As Apache Yarn comes, Hadoop can now have the ability to support data lake Architecture. Now we can run multiple workloads in a multi-tenant environment. The main aim of Apache Ranger is as follow -
- Authorization method across all Hadoop components.
- Centralize auditing of Administrative actions and user access.
- Enhanced support for different authorization methods.
- All the security-related tasks in central UI are to be managed by Security Administration.
A distributed file system, store all types of files in the Hadoop file system. Click to explore about, Data Serialization in Apache Hadoop
How to secure Apache Hadoop with Kerberos?This will be helping us to prevent unauthorized access to the Hadoop cluster that is storing and processing a large amount of data. It helps in providing strong authentication in Server/Client applications. For operating every time Kerberos Server will be asking Clients identification(principal). KDC(Key Distribution Centre) is a centralize to store and control all kerberos Principals and Realm(where all principals are assigned). The procedure of Kerberos Authentication -
- Consider User Principal be User@EXAMPLE.COM
- Service Principal be Hdfs/node23.example.com@EXAMPLE.COM
To access hdfs data in kerberized client machine -
kadmin.local -q “addprinc -pw orzota hdfs-user”
This will be show the time from when to when that ticket is valid and for which Service principal. With this, we can see that the user or service has been added to how authentication Kerberos works Kerberos database and whenever a connection needed to establish the authentication will be done first by a key which is stored in kdc through tgs.
password for hdfs-user@ORZOTAADMIN.COM
An open source dynamic data management framework which is licensed by Apache software foundation and is written in Java programming language. Click to explore about, Apache Calcite Architecture
How to secure Apache Hadoop with Apache Ranger?In the case of Ranger, we work with the repository. These repositories will be assigned to the agents or plugins that are in operation with those components. Along with these, there are policies which are associated with them that help to administrator them. We can assign them what operations we want them to execute like read, write, etc. The significant advantage of using Ranger is that it gives us privileges to make our own plugins as per requirement. Steps needed to install configure Ranger First of all, we have to install Ranger and its dependencies.
Now we have to set up the ranger admin. There will be a script present at
yum install ranger-admin
yum install ranger-usersync
yum install ranger-hdfs-plugin
.yum install ranger-hive-plugin
Which needs to be run. This will be doing the following modification -
- will add ranger user and group
- will create ranger DB
- Mysql users will be created with proper grants
Setup up ranger-usersync Edit the following
Start the ranger-usersync service, Run the following command -
Update “POLICY_MGR_URL” to point to your ranger host
The Ranger has been installed and configured properly. Now we have to go to Ranger UI phase Put the Url, as stated in the ranger-usersyn step. A window will be appearing that will ask for login and password. By default Login ID and password is ADMIN. There will be different repositories appearing in the Ranger Console Home Page. From the ranger console head, there will be presenting Ranger Plugins From where we can make that plugin active. Some of the plugins that are provided by Ranger are as follow -
- Apache Hive
- Apache HBase
- Apache Kafka
- Apache Knox
- Apache Hadoop YARN
- Apache Storm
- Apache Atlas
- Apache Solr