Introduction to Apache Cassandra
Apache Cassandra is a free open source NoSQL database. It stores the values in the form of key-value pairs. Cassandra is highly robust as it has masterless replication. So, Apache Cassandra works on the principle of clustering, so you don’t have a master or a slave concept. So you have like masterless replication. So, whenever you query the Apache Cassandra cluster, then you get data from one of the nodes, and you can configure them under your wish. You can set how many nodes you want in your cluster to have your data synced into. You can have your data synced into different nodes, and you can configure the number of nodes with which you can sync the data.
Facebook created Cassandra and then later Cassandra moved to Apache. So when Facebook created Cassandra, they inherited the features of Big Table and Dynamo Db. That’s why Apache Cassandra is more popular and very robust in handling a large amount of data.
Apache Cassandra Architecture
In the Apache Cassandra Architecture, there is a clustering concept, so there is like a ring. There are rings in the cluster, and technically, all the nodes are connected. So, there are rings, and one node is connected to the other node and likewise. The nodes are connected in the form of a ring, and when the request reaches to the Cassandra cluster, it hits one of the nodes and that specific node processes the applications.
For example – Let say you are writing a something on to a database so that particular node gets the data and then it writes on to the database in its node. Now there is a concept of Gossip, where this specific node tells the other node that the data has been received and updated so now you can also update. So, now the other node, node two, is also going to update its database background. So, this concept is called Gossiping. So, they gossip with each other, and then they identify and then persists the data into the database.
So, let’s say if one particular node gets down, and since there is master less replication, i.e., there is no central node which your process or your commands will be hitting away. So, if node one is down, then your commands will go to node two, and node two will be able to identify the data for you.
The next part is replication. Since it is masterless replication, here you can control the replication. Let herein Apache Cassandra cluster you have ten different nodes in the cluster. You can manage as many nodes that you want the data to be synced into. So, if you have years of the worth of data and you want to store that aggregate data into the cluster of ten machines. So, you can split them based on your convenience in different nodes. You can create replications as well so suppose node1 and node2 has January data and node 3 and node 4 has February data. So, you can configure that whenever you are pushing the data. So that is there in the Cassandra cluster. Even if the cluster is connected with ten nodes, you can define how many nodes that particular replicate it too.
If a person comes from relational databases background to Cassandra, he should see whether his requirements fit into the Cassandra World because Cassandra has this concept of unstructured data so you can have unstructured data in the Cassandra cluster and if you have relationship in your relational database Cassandra doesn’t support relationships, but you have a concept of the collection so one can put the collection in the same key-value pair in the same column (basically it is a table kind of thing in Cassandra). So you don’t have relationships in control in Cassandra you will have the data as a whole so there is no relationship concept and you can’t do a join in Cassandra. To solve this here, you need to structure your data in such a the way that your relationships are already taken care before you put the data into Cassandra. That is moving something from Cassandra to a traditional database.
Overview of Apache Cassandra on AWS
Apache Cassandra is a scalable open source NoSQL database that is used for managing large amounts of data. The data could be semi-structured, structured, and unstructured. Before knowing the best practices of using Apache Cassandra on AWS, we need to know some key concepts.
The largest unit of deployment in Cassandra is a Cluster. Every node in the cluster has one or more than one distributed locations(also known as Availability zones in AWS).
The distributed location has a collection of nodes in it as a part of the cluster. While using Cassandra, you can use multiple Availability for storing your data into the cluster. You can also configure the Cassandra for replicating the data across these numerous Availability Zones, this will allow your database cluster to be available during Availability Zone failure. The Availability Zone number should be multiple of that of replication factor.
Low-latency links are available between Availability Zones further avoiding republication latency.
In Cassandra, cluster node is part of a single distributed location which is used for storing partitions of data.
At every node in a cluster commit log is a write-ahead log.
A memTable is a write-back cache of rows of data which can be looked up by key. An individual memtable can store data for a unique table.
A sorted string table (Sstable) is a logical structure which is made up of multiple physical files over the disks. When a memtable is flushed to drive, then a Sstable is created. SStable is immutable structure-data.
Bloom filter data structure is used for testing set membership, which never produces a false negative, but it can be tuned for false positives. Bloom filters are the off-heap structures. So, if bloom filter responds that a key is not present in Sstable, it means that a key not present, but if a Bloom filter responds that a key is currently in the SStable, the may or may not be present in the SStable. Bloom filter also helps in scaling read requests in Cassandra.
An index file is for maintaining the offset keys into the main data files. By default, Cassandra has a sample of an index file in its memory and this sample file stores the offset for every 128th key in the primary data file and value can this be changed, i.e., it is configurable.
A keyspace, which is a logical container in the cluster, contains one or more tables, and the replication strategy is defined at keyspace.
A table is a logical entity in the keyspace which consists of a collection of the ordered column, and it is also known as column family. We need primary key definitions while defining tables.
Deploying Apache Cassandra Cluster on AWS
Step 1 – Install AWS CLI (you should have pip and a supported Python Version)
Use the command –
$ pip install awscli — upgrade — user //installing AWS CLI
Step 2 – Configure AWS CLI
AWS CLI will ask for four pieces of information. The AWS Access Key ID and AWS Secret Access Key will provide the account credentials.
$ aws configure AWS Access Key ID [None]: yourAWSAccessKeyID AWS Secret Access Key [None]: yourAWSSecretAccessKey Default region name [None]: us-east-2 Default output format [None]: json
Step 3 – Now, we have to create an AMI. AMIs can be created using Packer and Ansible. For building a packer image, go to the folder where the pacler file is saved and type the command –
$ packer build Packerfile
Step 4 – After creating the image, you will need to do the following –
Parameter section mention InstanceTypeParameter , CassandraImageIdParameter and KeyNameParameter.
You would also know one restriction- subnets with CIDRs, that have mentioned in Cassandra CloudFormation Template file must not already exist in your infrastructure at a particular AWS region.
Step 5 – When all the changes are done in CassandraCloudFormationTemplate file, type the below command for creating cloud formation stack
$ aws cloudformation create-stack — stack-name CassandraCloudFormationTemplate — template-body file:///pathToTheFolder/CassandraCloudFormationTemplate.txt
If you want to delete the stack and all the resources, type the command:
$ aws cloudformation delete-stack — stack-name CassandraCloudFormationTemplate
Step 6 – After Cloudformation stack and all resources have been created, the first thing you will have to do is to connect to your instances.
At first, you need to connect to bastion host (as it has public IP), then after that from baston host you can easily connect to your instances which are in the private subnets.
You can also use ssh agent forwarding, in case you did not want to save your private key.
For ssh folloe the commands –
$ exec ssh-agent bash $ ssh-add /pathToTheFolder/yourPrivateKey.pem $ echo exec ssh-agent bash >> ~/.bashrc $ echo ssh-add /pathToTheFolder/yourPrivateKey.pem | cat >> ~/.bashrc $ ssh -A firstname.lastname@example.org (go to bastion host)
When at bastion host follow the command to connect to the instance –
$ ssh email@example.com
instead of 10.0.2.96, mention the IP address of needed Cassandra server
Step 7 – For checking list of all available Cassandra Nodes, use the command –
$ sudo nodetool status
This command can be used from any Cassandra node.
A Data Management Strategy
Adapting Apache Cassandra helps Enterprises to Scale efficiently as Cassandra provides a Decentralized scalable platform to configure and manage your data.Want to know more about Data Management we advise taking the succeeding steps –