Apache Spark on AWS | Installation and Configuration

Overview of Apache Spark

Apache Spark is a distributed processing system which is used basically for large data workloads. Apache Spark uses optimized execution and also utilizes in-memory caching for fast performance. Moreover, it also has support for general batch processing, machine learning hoc queries, streaming analytics, graph databases, etc.

This article will give an overview of Apache Spark Security, Installation on AWS. Apache Spark service comes under Amazon EMR. You can go to the AWS management console, AWS CLI or Amazon EMR API and there you can create managed Apache spark cluster.You can also use additional features of Amazon EMR like Amazon S3 connectivity using Amazon EMRFS, integration with EC2 spot market and Glue data catalog, and auto-scaling for adding and removing instances from the cluster.

A place to store data on the cloud when data is ready for the cloud. It can immediately locate the data in Data lake with Amazon Glue that maintains the catalog of the data. Click to explore about, AWS Data Lake and Analytics Solutions

What are the features and use cases of Apache Spark?

The the features and use cases of Apache Spark are listed below:

Features

Fast Performance
Develop the application quickly
Create a variety of workflows
Integration with the Amazon EMR feature set

Use Cases

Stream processing
Machine learning
Interactive SQL

What is Apache Spark Security?

Apache Spark security aids authentication through a shared secret. Spark authentication is the configuration parameter through which authentication can be configured. It is the parameter which checks whether the protocols of the spark communication are doing authentication using a shared secret or not.

Both the sender and receiver must have some shared secret to communicate. They will not be allowed to communicate with each other if the shared secret is not alike. The shared secret will be created as follows -

For spark on YARN and local deployments, setting up spark authentication to actual will generate and distribute the shared secret.
For any other spark deployments, spark authenticates. The secret should be configured on every node.

Apache NiFi for Data Lake delivers easy-to-use, powerful, and reliable solutions to process and distribute the data over several resources. Click to explore about, Data Lake Services for Real-Time

WEB UI

The spark can be secured by using https/SSL setting and by using javax servlet filters through spark.vi.filters settings.

Authentication

The user specifies javax servlet filters which can authenticate the user. Spark compares the user and the view ACLs to ensure that they are authorized to view UI, once they are logged in. Spark.acls control the behavior of ACLs.enable, spark.vi.viewls.groups. To control the accessibility to modify a running spark application, spark also supports to modify ACLs. Spark.acls do this.enable, spark.modify.acls and spark.modify.acls.groups.

Event Logging

For event logging, the event log files must have the proper permission set. The owner, who created the directory, must be the super user. There should be group permissions, which may allow the user to write to the directory but prevent unauthorized access from altering or updating a file. Only the owner is permitted to do that.

Encryption in Apache Spark Security

Spark support SASL encryption and SSL for HTTP protocols. It supports AES based encryption for RPC connections.

An open-source container orchestration engine and also an abstraction layer for managing full-stack operations of hosts and containers. Click to explore about, Kubernetes Architecture and its Components

SSL Configuration for Apache Spark Security

There is a hierarchical organization or SSL configuration. Using this, basic settings can be provided to all the protocols. The SSL settings are at spark.ssl namespace. SSL must be configured at every node and each node component involved in communication.

YARN Mode

The preparation of key-store is done on the client side and is then distributed, and the executors use it as a component of their application. This is done as the user can deploy files before the starting of the application in YARN by using spark.yarn.dist.archives configuration settings.

Standalone Mode in Apache Spark Security

The user should provide the key-store and configuration options for master and worker. The user shall allow the executors to make use of SSL settings which are gained from the worker which brought on that executor. This can be done by setting spark.ssl.useNodeLocalConf to true. By setting this parameter,the executors cannot use the settings provided by the user on the client side.

A portable open-source platform that helps in managing container services and workloads. Click to explore about, Kubernetes Security Tools and Best Practices

Preparing the key stories of Apache Spark

The generation of the key-store is done by the keytool program and the steps involved in configuring the key-stores and the trust-stores are as follows -

For each node, a pair of the key is generated.
The public key from the pair is exported to a file on each node.
All the public keys are imported into a single trust-store.
The trust-stores are than distributed over the nodes.

Configuring SASL Encryption

Presently, SASL encryption is aided for block transfer when authentication is enabled. To enable SASL encryption, set spark.authenticate.enableSasl Encryption to true. It is possible to disable the unencryption connections by configuring spark.network.sasl.serverAlwaysEncrypt to true, when using an external shuffle service.

Installation of Apache Spark on AWS

The steps to install apache spark on aws is listed below:

Pre-requirements

You require an AWS account to obtain AWS services.
You need to produce an EC2 key pair and import it to an SSH client.

Why EC2?

We need a server if we want to install Spark. Amazon EC2 is one such server.

An open-source system, developed by Google, an orchestration engine for managing containerized applications over a cluster of machines.. Click to explore about, Kubernetes Deployment Tools and Best Practices

Configuring and Launching a new EC2 instance

The steps to configure and launch a new EC2 instance are below:

Creating an IAM role

Login to your AWS management console and select Identity and Access Management services.
Select ‘Create new Role.’
On step 1, set role name.
On step 2, set role type.
Step 3 can be skipped. On step 4, attach the policy.
On step 5, Review, select Create Role.
Select the cube icon and return to the list of AWS service offerings.

Creating a security group

From Management console, select EC2 service.
Create a security group by navigating to Network and Security.
Set the security group name to value and set the description to security group protecting the instance of the spark.
Select the Inbound tab and then select Add Rule.
Set the type of SSH and the source to My_IP. If in any case your IP address changes, the rule can be updated from here.
Select Add Rule and add another rule.
Select the Outband tab now, and you may review the rules now.
Select create now.
You can set the name if the name field is blank.

Creating the EC2 Instance

Select the EC2 service from the AWS Management Console
Select Launch Instance by navigating to Instances. This starts a wizard workflow now.
On step 1, select an Amazon Machine Image (AMI).
On step 2, select the Instance Type.
On step 3, configure the details of the instances. Set IAM Role to the IAM Role that has been created earlier.
On step 4, add Storage.
On step 5, tag an Instance.
On step 6, configure the security group, select ‘select an existing security group’ and chose the one you created earlier.
On step 7, select Launch and review instance launch.

Managing the EC2 Instance

There are charges if we don’t stop our EC2 instance.

To start or to prevent an EC2 instance, select Actions from the table of instances. From the menu now, you may start or stop. There will be no charges if the instance is stopped.
You can permanently terminate an instance, select Actions from the table of instances. Select Instance settings and change termination protection.

Connecting to the EC2 Instance

Select the EC2 instance from the dashboard. Details about the instance will appear.
The Public_IP address of the instance is to be recorded. You may access this via a web browser.
You may use an SSH client to connect to the Public IP once your instance starts running.
There will be a login message on your first login.

Serverless Framework is serverless computing to build and run applications and services without thinking about the servers. Click to explore about, Kubeless - Kubernetes Native Serverless Framework

What are the steps to install Apache Spark?

The Installation of Apache Spark is listed below:

Downloading Spark

Visit the Apache Spark Download page in your web browser. A download link needs to be generated which we can access from our EC2 instance. Copy the download link to your clipboard to paste it to your EC2 instance. From the EC2 instance, type these commands -

# Download Spark to the ec2-user's home directory
cd ~
wget http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.4.0/spark-2.4.0-bin-
hadoop2.7.tgz
# Unpack Spark in the /opt directory

sudo tar zxvf spark-2.4.0-bin-hadoop2.7.tgz -C /opt

# Update permissions on installation
sudo chown -R ec2-user:ec2-user /opt/spark-2.4.0-bin-hadoop2.7
  
# Create a symbolic link to make it easier to access
sudo ln -fs spark-2.4.0-bin-hadoop2.7 /opt/spark

Set the SPARK_HOME environment variables to complete your installation. You need to log in or log out again, for an effect.

A Holistic Strategy

XenonStack offers Cloud Consulting, Cloud Migrations, and Managed Services for Cloud in AWS.To know more about AWS and Cloud Solutions we recommend taking the following steps-

Lear more about AWS DevOps Solutions
Contact Us about AWS Migration and Managed Services
Learn more about Installation of Zookeeper on AWS

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack

Personalization

In Which Agentic Platform and Accelerator you are Interested? *

Which segment does your company belong to? *

What is your primary focus areas? *

At what stage is your AI use case currently in? *

What are the primary challenges in adopting AI? *

What kind of infrastructure does your organization currently using? *

Are you using any Data platform? *

Preferred Approach for AI Transformation *

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Captcha Verification *

your request has been submitted successfully !