Overview of Apache Flink Security
This article will give an overview of Apache Flink Security. Understanding what Apache Flink Kerberos based security aims at-
-
To provide secure data access for jobs in the cluster through connectors.
-
To validate to Zookeeper.
-
To validate to Hadoop components.
Kerberos keytabs are not limited to a frame of time as of Hadoop delegation token or unlike tickets cache entry. In the context of production deployment, validation to secure data sources needs to be required for a long duration. It may be days or weeks and even months. In present scenario execution of flink clusters is either done through a configured keytab credentials or with Hadoop delegation token. We can quickly launch a different flink cluster with different setting if we are using a different keytab for a specific job. There are different flink cluster that can run simultaneously in a YARN or Mesos environment.
An open-source, distributed processing engine and framework of stateful computations written in JAVA and Scala. Click to explore about, Data Processing with Apache Flink
How does Apache Flink Security work?
Conceptually, first or third party connectors (HDFC, Cassandra, Flume, Kafka, Kinesis, etc.) may be used by a flink program which requires some authentication method such as Kerberos, password, SSL/ TLS, etc.). Apache Flink provides first-class support for authentication of Kerberos only while providing effortless requirement to all connectors related to security. Kafka (0.9+), HDFS, HBase, Zookeeper are the connectors or the services that are supported for Kerberos authentication. The Apache Flink security modules (implementing org . apache . flink . runtime . security. modules . Security Module) are installed at startup. Following are the sections which describe each of the security modules.
Hadoop Security Module
The Hadoop security makes use of Hadoop User Group Information (UGI) class to build a login user context which would be process-wide. To interact with Hadoop, HBase, HDFS, and YARN, it is login user that would be used. If the security modules are enabled, the login user can have anything that Kerberos identical configures. The login user otherwise conveys only the identity of the user of the OS are that has launched the clusters.JAAS Security Module
The component such as Zookeeper or Kafka that rely on JAAS is provided a dynamic JAAS configuration to the clusters through this module. The static JAAS configuration can also be provided by the user using the steps described in the Java SE Documentation. The static entry may be overridden by the dynamic entries provided through this module.Zookeeper Security Module
Specific setting related to security such as Zookeeper service name (default: Zookeeper) and the Zookeeper security module configures the JAAS login context name (default : client).The process used for analyzing the huge amount of data at the moment it is used or produced. Click to explore about, Real Time Data Streaming Tools and Technologies
What are the deployment modes in Apache Flink Security?
The deployment mode involves -- Standard mode
- YARN/Mesos mode
- The security-related configuration option is added to the flink configuration file on all the cluster modes.
- Make sure that the keytab file is existing in the path as indicated by security . Kerberos. login. keytab on the cluster mode.
- Deploy the flink cluster.
- The security-related configuration option is added to the flink configuration file on all the client.
- Make sure that the keytab file is existing in the path as indicated by security. Kerberos . login . keytab on the client mode.
- Deploy the flink cluster.
Using kinit (YARN only)
It is feasible to deploy a secure Flink cluster without a keytab in YARN mode, using the ticket cache. The complexity of generating keytabs are avoided through this. The steps involved in running a secure Apache Flink cluster using kinit -- The security-related configuration option is added to the Flink configuration file on all the client.
- Login using the kinit command
- Deploy flink cluster
- Kerberos Authentication Support
- Service Level Authorization
- Transport Security (SSL/TLS)
Kerberos Authentication Support
- There is a cluster level Kerberos identity. This is keytab based and is shared by all the jobs, thus making it not job-specific.
- This enables Kerberos authentication. The examples include data servers and sinks like HDFS and Kafka.
- This protects the state data.
- This is supported in standalone and YARN deployment modes.
Service Level Authorization
- It restricts access to your Flink cluster.
- Protects all the endpoints, including control path, intra-cluster data transfer, web UI, etc.
- The simple shared secret is either configured or generated. It may be stored on clients or in clusters.
- It is supported in Standalone and YARN deployment modes.
Transport Security (SSL/TLS)
- It is SSL for all connections.
- It may be enabled on a per-endpoint basis.
- It is supported in Standalone and YARN deployment modes.
Streaming is unstructured data that is generated continuously by thousands of data sources. Click to explore about, Real Time Streaming Application
Installation of Apache Flink on AWS
Amazon Web provides certain services related to cloud computing on which you can run Apache Flink.EMR - Elastic MapReduce
Amazon Elastic MapReduce (Amazon EMR) web service quickly set up a Hadoop server. It takes care of setting up everything. Therefore, this is the recommended way to run Flink on Amazon Web Services.Create an EMR Cluster
Make sure to set up I AM roles when creating your cluster. This allows accessing your S3 buckets if required.Installing Apache Flink on AWS EMR Cluster
You can connect to the master node and install Flink after creating your cluster. Download a binary version of Flink matching your EMR cluster from the download page. You are ready to deploy Flink jobs after extracting the flink distribution via YARN after setting the Hadoop Configuration directory -HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster
examples/streaming/WordCount.jar
S3 - Simple Storage Service
The Simple Storage System using Flink for reading and writing data as well as with the streaming state backends. You can use S3 files by providing paths as follows -s3://<your-bucket>/<endpoint>
Set S3 FileSystem
S3 is considered as a FileSystem by Flink. Through a Hadoop S3 FileSystem client interactions are done. There are two popular S3 file system implementations available. First is the S3 A FileSystem and second is the Native S3 FileSystem.
S3AFileSystem -It works on IAM roles. It uses Amazon’s SDK internally. It is a file system used for reading and writing regular files.
NativeS3FileSystem - It is also used for reading and writing regular files. It does not work with IAM roles, and the maximum size object is 5GB.
Configure Access Credentials
You want to make sure that Apache Flink is allowed to access your S3 buckets after setting up the S3 filesystem.Identity and Access Management (IAM) (Recommended)
In order to access S3 buckets, you can use IAM features to give Flink instances securely.Common issues in Installation of Apache Flink on AWS
- Missing S3 FileSystem Configuration.
- Amazon Web Services access key ID and secret access key not specified.
- ClassNotFoundException
- IOException
- NullPointerException
A Comprehensive Approach
Real Time Processing of data has enabled Enterprises to perform Real-Time Intelligence and Real-Time activity monitoring in very less time. To know more about Real Time Processing we advise taking the following steps -- Read more about Streaming Analytics Architecture, Tools and Best Practices
- Understand What is Apache Flink, Its Advantages