Beginners guide to learn Cassandra-Part 2: Query First Approach

Ayman Nait Cherif
6 min readMay 6, 2020

--

This is the second part of the series, we’ll talk about column-oriented databases and Cassandra Query First Approach

Cassandra logo

Column Oriented Database

While a relational database is optimized for storing rows of data, typically for transnational applications, a columnar database is optimized for fast retrieval of columns of data, typically in analytical applications. Column-oriented storage for database tables is an important factor in analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.

Cassandra is an open-source, column-oriented database designed to handle large amounts of data across many commodity servers. Unlike a table in a relational database, different rows in the same table (column family) do not have to share the same set of columns.

Row vs Column storage

Example

https://youtu.be/s1xc1HVsRk0?list=PLalrWAGybpB-L1PGA-NfFu2uiWHEsdscD

This is an employee table, we can see that there are several columns in the database, there’s an ID a name and age a gender and a car, while traditional row-oriented databases will store the data in rows, column-oriented databases tend to store each column in a separate file on a disk, so that gives the advantage of accessing the columns that we’re looking for, for a particular query, so if we’re only interested in the name and age columns we only have to access those two columns, in a relational database if we want to access name we have to go through every row, even through data that we’re not concerned with.

We can see that in a column-oriented database it’s like a two-dimensional key-value pair so each row has its ID and then column within the row has a key. the column-oriented database helps to compact the data, so if we look at the gender column we can see there’s a lot of data duplication.

What column orientated databases sometimes do is that we can replace the ‘{Gender: M}’ with something like the ‘{Gender M}*4’ so this stands for that the next four rows including this row will be ‘{Gender: M}’ and we can do the same for the female, this process means we’re storing less data on disk and the database is overall more efficient.

Another interesting thing is that a column-oriented database allows for certain rows to not contain data from a particular column, in our case the car some value on car column doesn’t exist, in a relational database, these would have to be filled with “null” or some sort of void statement which would mean that we’re storing something on disk and filling up room on the disk with basically nothing.

Cassandra Query First Approach

In the traditional database, there are usually database tables that represent entities in the real world.

in this example, we have an employee table and a company car table, that’s called relational modeling where every table in your database represents a real-world entity and we tie these tables together using relationships, the ID in the Employee table is the primary key which means that uniquely represents an employee, the “ CompanyCarId ” here is a reference to the “Company Car” table, so the company car ID in the row one means that Elvis Presley drives a BMW 5-series.

That’s what’s known as a secondary index or foreign key, and one of the most common features of a relational database that we use joins in our queries.

Joins are good when our database is running on a single machine but in a distributed database where our data could be lying across multiple machines joins are not very efficient because a single transaction or single request on the database could end up tying up multiple machines on our cluster at the same time and we wouldn’t be able to serve as very many requests.

Cassandra gets a lot of its speed in database reads and writes from the fact that it never has to perform any joins on the database in fact in Cassandra it’s impossible to perform joins, so instead of taking a relational model approach we take a query first approach, this means we design our tables for a specific query, as a result, we want to have every table which is catered for a specific query, rather than flexible tables such as this employee table and the company car table, hence, we only ever have to query one table when we’re reading or writing data, obviously there are some consequences to this approach, as we might end up writing the same data to multiple tables just to satisfy different queries but it’s ideal for a Cassandra’s distributed architecture.

we need to be careful when we’re modeling our data in Cassandra to make sure the tables we have set up will satisfy the queries that we think we’re gonna need, this seems like a bad thing but in a lot of databases that are under high load we end up doing this anyway in a process called denormalization.

if we take a look at the query first approach of what these tables would look like in a relational database

we can see here that we now have two different tables reflecting the two queries we want to make, the get employee by car make and get company car by ID, this will allow us to quickly get all the car makes for the employees that drive that particular car make by only querying one table, in a Big Data world this can lead to great performance gains.

Finally, we can look at how these tables might be implemented in a Cassandra database

we can see that the two tables are supported and the table name is the same as in the query first approach in a relational database however we can now see that we’ve taken a columnar database approach where every column has its key and its value for each row, we can also see that every row has its partition key in Cassandra and that’s something we will look at next article, it’s also important to note that we can make some savings in data storage, for instance, we can reduce and compact the fields, we can also have empty columns for some rows, some people might not have a surname or a salary and that will be supported in our database design.

this saves us some space in our database and can lead to more efficiency, so even though often we duplicate data by different tables performing different queries on the same data we can still save some room.

Conclusion

some of the most important things to remember when beginning with Cassandra that we should always take a query first approach to our database design, our tables should reflect the queries we are trying to make. It’s also important to note that if we need to perform queries that we haven’t supported in our data model, it may be hard or underperforming to do so.

However, even with the apparent downsides of the Apache Cassandra database model, it can lead to huge gains in performance and storage space in a distributed system, while a relational database might be a better choice for small-scale data in a machine or on a single data center.

Apache Cassandra comes into its own when we need to store and write a huge amount of data and when we’ve got a lot of clients.

In the next article, we’ll see how Cassandra stores data on its nodes and how it decides which data is stored in which node

--

--