Installing Apache Cassandra on Mac: Local Install of Single Node Instance

http://cassandra.apache.org/

I have always been a big fan of disorder, what the scientific community refers to as entropy. We all are aware that all things come to an inevitable end. Physicists tell us that the entropy of the universe is continuously increasing, that our physical reality is on a one way course from an ordered state to a disordered state.

This is especially true with machines. We Americans love our cars and we love the bill that comes with the inevitable wear and tear. Computers are no different and are just as susceptible to succumbing to entropy’s inevitable influence.

Apache Cassandra is an open source distributed database management system developed with entropy in mind: things break and hardware sometimes fails.

As a distributed database system based upon a decentralized and shared nothing architecture, nodes in the Cassandra cluster are functionally identical commodity servers. Data is distributed and replicated across each autonomous node — no masters or slaves here — and with no single point of failure, the Cassandra cluster handles vast amounts of data and scales linearly with the inclusion of additional nodes.

I wanted to try out a local installation of Cassandra so I could get experience entering data via the Cassandra Query Language — a querying language analogous to SQL in the world of Relational Database Management Systems. If you’ve used SQL before CQL should seem very similar.

There are some fundamental architectural differences between Cassandra and a RBMS like PostgreSQL but keep in mind this is not a rows and columns type arrangement. The focus of this article is to show you how to install Cassandra as a single node locally on OS X El Capitan.

Installation

Java 7 or 8 is a prerequisite for Cassandra so before we can proceed with the installation we will need to download the Java Development Kit.

Next we need to create a directory to keep Cassandra.

mkdir -p ~/opt/packages/cassandra/

Change into the directory you created and download Cassandra 3.5. I used GNU Wget but the curl command works as well.

curl -O http://www-us.apache.org/dist/cassandra/3.5/apache-cassandra-3.5-bin.tar.gz

Now extract the tar archive.

tar xzvf apache-cassandra-3.5-bin.tar.gz

In case we need to upgrade Cassandra in the future, let’s create a symbolic link to the Cassandra 3.5 directory. This will keep is from having to change environment variables later on.

ln -s ~/opt/packages/cassandra/apache-cassandra-3.5 ~/opt/cassandra

We want the ability to execute Cassandra commands from any directory so we will add Cassandra to our system PATH. Open your bash profile with your preferred text editor. I am using Sublime Text.

subl ~/.bash_profile

Now add the following to the bash profile file.

# Cassandra
if [ -d "$HOME/opt/cassandra" ]; then
export PATH="$PATH:$HOME/opt/cassandra/bin"
fi

Source your bash profile.

source ~/.bash_profile

Verify Cassandra installation.

cassandra -v
# expected output:
# 3.5

Now we will start the Cassandra server. We will be using the non-daemon process which will allow us to see output to the terminal.

cassandra -f

If all goes well you should see something like this:

Screenshot of Cassandra single node initialization on localhost

How It Works

The architectural details of Cassandra lie beyond the scope of this article but a brief overview is warranted. At level comparable with the high of Mt. Everest, a Cassandra cluster consists of a decentralized network of autonomous nodes (commodity servers) with a ring-like network typology. A client connects to the Cassandra cluster by connecting with any particular node present in the ring and interfaces with that node via CQL. The node the client connects to is called the coordinator and it assumes the responsibility of satisfying the client’s request.

Cassandra is a distributed database system and relies on data partitioning to equally distribute the data amongst the nodes in the cluster. To avoid the problem of single-point-of-failure associated with shared nothing architectures, Cassandra employs data replication to store replicas (data copies) on participating nodes. This contributes to the high availability of the database system.

The coordinator will inform the client of a successful read or write operation in accordance with the established consistency level. This parameter allows users to configure the number of replicas that must acknowledge a read or write operation before the client is informed of a successful operation. Alternatively, this parameter describes the number of nodes which must complete the operation before the client is informed of a success.

What is interesting here is that an operation can be considered successful before data has been propagated to every node in the cluster. This is very important for increasing network performance. Nodes in the Cassandra cluster rely on the Gossip Protocol to exchange information with each other.

This protocol allows nodes to obtain state information about other nodes by exchanging information a node has about itself and other nodes. A particular node does not directly exchange information with every other node in the cluster; data is exchanged with a few nodes and with the passage of time data propagates throughout the cluster in a manner similar with which a virus would spread through a population.

Cassandra Query Language

With the Cassandra server running, open a new terminal window and access the Cassandra Query Language shell by typing:

cqlsh 
Create Keyspace via cqlsh

The first thing we will do is create a Keyspace. A keyspace is a container for our application data. You could think of it as an analogue to schema of a RDBMS. The keyspace requires that the replication strategy and replication factor be specified — the number of nodes data must be distributed as replicas to.

CREATE KEYSPACE test01
WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 1
};

To view all keyspaces use:

DESCRIBE KEYSPACES

Switch keyspace:

USE test01;

Create a table:

CREATE TABLE countries (
id INT PRIMARY KEY,
official_name TEXT,
capital_city TEXT
);

Insert data into the partition with a single row:

INSERT INTO countries (id, official_name, capital_city) VALUES (1, 'Islamic Republic of Afghanistan', 'Kabul');

Query the data you just entered:

SELECT * FROM countries WHERE id = 1;
Table shown for query

And that should have you running Cassandra on your Mac and entering data into the system. Expect more tutorials to come as I explore Python frameworks and more database systems.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.