Overview of Apache Hadoop Security
It is a platform or framework in which Big Data is stored in Distributed Environment and processing of this data is done parallelly. The capacity of Hadoop to handle structured and unstructured data provides users more flexibility for collecting, processing and analyzing data than that of provided by relational databases or warehouses.The Apache Hadoop is so widespread that it is adopted by many famous companies like Facebook, Yahoo, DataDog, Netflix, Adobe, etc. Apache Hadoop Security mainly consists of two components –
The place where all the data is stored in a cluster is HDFS (Hadoop Distributed File System). In general, we can see that HDFS is a single unit that is storing for storing all the Big Data, but in reality, all the data is distributed across multiple nodes in a distributed fashion. Master- HDFS follows slave architecture.
There are two nodes used in HDFS, i.e. Namenode and Datanode. Namenode act as Master and Datanode is Slave. In general, we say that the Master node contains metadata or information regarding which kind of data is stored at which node while on the Other way actual data is stored in the Datanode. As we all know, failure rates of the hardware are pretty high so keeping this in mind we have to replicate the data in all the data nodes that can be used in case of emergency. You can also explore more about Apache Hadoop 3.0 in this insight.
Now the data that is stored in HDFS is needed to be processed and necessary information is driven so this is done by the help of YARN (yet another resource, negotiator). YARN allocates resources and Schedules tasks to process this activity. YARN having its major components i.e. Resource Manager and Node Manager. Resource Manager is Master node that receives the request, and then he transfers this request to node Manager accordingly where there is actual processing takes place.
Introduction to Ranger
Apache Ranger is used to enabling, manage, and monitor the security of data across the Hadoop platform. The main aim of Ranger is to provide security across the Hadoop Ecosystem. As Apache Yarn comes, Hadoop can now have the ability to support data lake Architecture. Now we can run multiple workloads in a multi-tenant environment. The main aim of Apache Ranger is as follow –
- Authorization method across all Hadoop components.
- Centralize auditing of Administrative actions and user access.
- Enhanced support for different authorization methods.
- All the security-related tasks in central UI are to be managed by Security Administration.
So we can say that it delivers a comprehensive approach to security for the Hadoop cluster. A centralized platform is provided in which security policies are consistently managed across the Hadoop components. We can easily manage policies to access databases, tables, files, folders, columns, etc.
Ranger is also beneficial for the security administrators as this provide them with deep visibility that helps them to track all the real-time request and help them to solve multiple destination sources.
Secure Apache Hadoop with Kerberos and Ranger
Overview of Hadoop Authentication
This will be helping us to prevent unauthorized access to the Hadoop cluster that is storing and processing a large amount of data. It helps in providing strong authentication in Server/Client applications. For operating every time Kerberos Server will be asking Clients identification(principal).KDC(Key Distribution Centre) is a centralize to store and control all kerberos Principals and Realm(where all principals are assigned).
The procedure of Kerberos Authenticatio –
- Consider User Principal be User@EXAMPLE.COM
- Service Principal be Hdfs/node23.example.com@EXAMPLE.COM
To create principal –
kadmin.local -q “addprinc -pw orzota hdfs-user”
To access hdfs data in kerberized client machine –
kinit password for hdfs-user@ORZOTAADMIN.COM .klist
This will be show the time from when to when that ticket is valid and for which Service principal.
With this, we can see that the user or service has been added to how authentication Kerberos works Kerberos database and whenever a connection needed to establish the authentication will be done first by a key which is stored in kdc through tgs.
Apache Hadoop Security with Ranger
In the case of Ranger, we work with the repository. These repositories will be assigned to the agents or plugins that are in operation with those components. Along with these, there are policies which are associated with them that help to administrator them. We can assign them what operations we want them to execute like read, write, etc.
The significant advantage of using Ranger is that it gives us privileges to make our own plugins as per requirement.
Steps needed to install configure Ranger
First of all, we have to install Ranger and its dependencies.
yum install ranger-admin yum install ranger-usersync yum install ranger-hdfs-plugin .yum install ranger-hive-plugin set JAVA_HOME
Now we have to set up the ranger admin. There will be a script present at
Which needs to be run. This will be doing the following modification –
- will add ranger user and group
- will create ranger DB
- Mysql users will be created with proper grants
Now we need to start the ranger-admin service
Setup up ranger-usersync
Edit the following
/usr/hdp/current/ranger-usersync/install.properties file. Update “POLICY_MGR_URL” to point to your ranger host Run /usr/hdp/current/ranger-usersync/setup.sh
Start the ranger-usersync service, Run the following command –
The Ranger has been installed and configured properly. Now we have to go to Ranger UI phase
Put the Url, as stated in the ranger-usersyn step. A window will be appearing that will ask for login and password. By default Login ID and password is ADMIN.
There will be different repositories appearing in the Ranger Console Home Page. From the ranger console head, there will be presenting Ranger Plugins From where we can make that plugin active. Some of the plugins that are provided by Ranger are as follow –
By the help of these above plugins Hadoop administrator can create new policies for the users to access the services of the Hadoop.
For every plugin, we enable there is need to restart the associated component like if we have allowed HDFS the, we have to restart the HDFS.
A Comprehensive Approach
Apache Hadoop provides a platform for distributed storage and distributed processing of extensive data sets. To know more about Apache Hadoop we advise taking the following steps –