The brief introduction to Cassandra.

Cuicui Feng
Nov 7 · 5 min read

With the gradually learning about the big data, the difference between all these NoSQL databases become more clear. There are Key-Value, Column-based, graphic databases, different data structures, can be applied to websites and companies’ applications. As for the specific database implementation, from the Hadoop to MongoDB, from the Hive to Pig, there are advantages and disadvantages.

The first time to know about Cassandra is the class, Big Data Architecture & Governance. The professor asked everyone to learn one more big data database and present that in class. When I heard about Cassandra, the first thought is the story from Greek mythology. It means a woman cursed to utter true prophecies, but never to be believed. And I think the Apache Cassandra’s name was also inspired by this beautiful mystic seer.

Cassandra is a popular open-source NoSQL database, which is a column-family data store. It is being successfully used in a variety of contexts like analytics, time series analysis, monitoring, retail, e-commerce, etc. So I was impressed by the powerful functions of Cassandra.

I will briefly introduce Cassandra and state it through the next five parts.

Part1: Overview

Part2: Characteristics(Key Features) of Cassandra

Part3: The Architecture of Cassandra

Part4: Data Structure of Cassandra

Part5: Data Flow: The Read and Write Process


Part1:Overview

website: http://cassandra.apache.org
website: http://cassandra.apache.org

When we begin to learn something about a new coding language or some advanced tools, the most useful website is its official website. You can not only download and read the document but also know the most important features and application in reality.

This is official introduction to Cassandra on the website.

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

When you read this brief introduction to Cassandra, you can extract three key words from that. Scalability, high availability and lower latency. This is the obvious advantages compared to some other databases. Then, let me explain these features.


Part2: Key Features of Cassandra

How Cassandra can implement its scalability? The answer is its structure. Due to its multi-master architecture, Cassandra is linearly scalable. This means you can just double the number of nodes in a cluster, and it can handle twice the writes. And I will extend more details in the next part.

We can prove it's scalability through some examples. There are the largest production deployments include Apple’s, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).

How Cassandra can implement its availability? The answer is also its structure. When other databases’ master node goes down, like MongoDB, it will stop taking new writes until the rest of the nodes choose a new master. At the same time, in Cassandra, if one node goes down, the writes are redirected towards other nodes and the system continues to operate. It means you can do not worry about the accident.

And Cassandra is also well known for its impressive performance in both reading and writing data. Data is written to Cassandra in a way that provides both full data durabilityand high performance. This means it will bring low-latency.

With so much high performance, it is necessary to know its architecture.


Part3: The Architecture of Cassandra

Maybe you know the essential framework about the MongoDB or some others. They are one master with some slaves. All the writes goes on a master node and reads are executed on slaves.

But the Cassandra is totally different. It is master-less architecture, and you also can call it multi master node.

So in some time, you can hypothesis it is impossible for Cassandra to crush. When one node goes down, others can also work normaly.


Part4: Data Structure of Cassandra

The normal data structure we know is a row-oriented store and column-oriented store. Let me take an example. This is some info we need to store.

And this is the structure data store method.

The store method in Cassandra is one combined structure.

Let’s consider a scenario where we want to store temperature values. In a row based store we typically create a table temperatures with two columns (1) timestamp (2) value.

Now consider another example where we need to store various attributes of fruits. Following is a perfectly valid data model in Cassandra.


Part5: Data Flow: The Read and Write Process

When you try to write data to Cassandra, at first it will be write to Commit Log. At the same time, it will sent to memtable. If there are large amount of data to stored, there is a special component to store all the information that called SSTable.

Reading data from Cassandra involves a number of processes that can include various memory caches and other mechanisms designed to produce fast read response times.


I also did some conclusions about its functional and non-functional requirements.

This my first time to post the article in the medium. If you think there are any problems or issues, please let me know.

Cuicui Feng

Written by

Interested in data analyst co-op in 2020 Spring and pursuing a career in data science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade