Apache Spark is a distributed processing system which is used basically for large data workloads. Apache Spark uses optimized execution and also utilizes in-memory caching for fast performance. Moreover, it also has support for general batch processing, machine learning hoc queries, streaming analytics, graph databases, etc.
This article will give an overview of Apache Spark Security, Installation on AWS. Apache Spark service comes under Amazon EMR. You can go to the AWS management console, AWS CLI or Amazon EMR API and there you can create managed Apache spark cluster.You can also use additional features of Amazon EMR like Amazon S3 connectivity using Amazon EMRFS, integration with EC2 spot market and Glue data catalog, and auto-scaling for adding and removing instances from the cluster.
A place to store data on the cloud when data is ready for the cloud. It can immediately locate the data in Data lake with Amazon Glue that maintains the catalog of the data. Click to explore about, AWS Data Lake and Analytics Solutions
What are the features and use cases of Apache Spark?
The the features and use cases of Apache Spark are listed below:
Apache Spark security aids authentication through a shared secret. Spark authentication is the configuration parameter through which authentication can be configured. It is the parameter which checks whether the protocols of the spark communication are doing authentication using a shared secret or not.
Both the sender and receiver must have some shared secret to communicate. They will not be allowed to communicate with each other if the shared secret is not alike. The shared secret will be created as follows -
For spark on YARN and local deployments, setting up spark authentication to actual will generate and distribute the shared secret.
For any other spark deployments, spark authenticates. The secret should be configured on every node.
Apache NiFi for Data Lake delivers easy-to-use, powerful, and reliable solutions to process and distribute the data over several resources. Click to explore about, Data Lake Services for Real-Time
The spark can be secured by using https/SSL setting and by using javax servlet filters through spark.vi.filters settings.
The user specifies javax servlet filters which can authenticate the user. Spark compares the user and the view ACLs to ensure that they are authorized to view UI, once they are logged in. Spark.acls control the behavior of ACLs.enable, spark.vi.viewls.groups. To control the accessibility to modify a running spark application, spark also supports to modify ACLs. Spark.acls do this.enable, spark.modify.acls and spark.modify.acls.groups.
For event logging, the event log files must have the proper permission set. The owner, who created the directory, must be the super user. There should be group permissions, which may allow the user to write to the directory but prevent unauthorized access from altering or updating a file. Only the owner is permitted to do that.
Encryption in Apache Spark Security
Spark support SASL encryption and SSL for HTTP protocols. It supports AES based encryption for RPC connections.
There is a hierarchical organization or SSL configuration. Using this, basic settings can be provided to all the protocols. The SSL settings are at spark.ssl namespace. SSL must be configured at every node and each node component involved in communication.
The preparation of key-store is done on the client side and is then distributed, and the executors use it as a component of their application. This is done as the user can deploy files before the starting of the application in YARN by using spark.yarn.dist.archives configuration settings.
Standalone Mode in Apache Spark Security
The user should provide the key-store and configuration options for master and worker. The user shall allow the executors to make use of SSL settings which are gained from the worker which brought on that executor. This can be done by setting spark.ssl.useNodeLocalConf to true. By setting this parameter,the executors cannot use the settings provided by the user on the client side.
The generation of the key-store is done by the keytool program and the steps involved in configuring the key-stores and the trust-stores are as follows -
For each node, a pair of the key is generated.
The public key from the pair is exported to a file on each node.
All the public keys are imported into a single trust-store.
The trust-stores are than distributed over the nodes.
Configuring SASL Encryption
Presently, SASL encryption is aided for block transfer when authentication is enabled. To enable SASL encryption, set spark.authenticate.enableSasl Encryption to true. It is possible to disable the unencryption connections by configuring spark.network.sasl.serverAlwaysEncrypt to true, when using an external shuffle service.
Installation of Apache Spark on AWS
The steps to install apache spark on aws is listed below:
You require an AWS account to obtain AWS services.
You need to produce an EC2 key pair and import it to an SSH client.
We need a server if we want to install Spark. Amazon EC2 is one such server.
Visit the Apache Spark Download page in your web browser. A download link needs to be generated which we can access from our EC2 instance. Copy the download link to your clipboard to paste it to your EC2 instance. From the EC2 instance, type these commands -
# Download Spark to the ec2-user's home directory
# Unpack Spark in the /opt directory
sudo tar zxvf spark-2.4.0-bin-hadoop2.7.tgz -C /opt
# Update permissions on installation
sudo chown -R ec2-user:ec2-user /opt/spark-2.4.0-bin-hadoop2.7
# Create a symbolic link to make it easier to access
sudo ln -fs spark-2.4.0-bin-hadoop2.7 /opt/spark
Set the SPARK_HOME environment variables to complete your installation. You need to log in or log out again, for an effect.
A Holistic Strategy
XenonStack offers Cloud Consulting, Cloud Migrations, and Managed Services for Cloud in AWS.To know more about AWS and Cloud Solutions we recommend taking the following steps-