Out of the many training days I get at office to learn anything in the world, I chose to learn something related to work. I settled on Cassandra, which was a revelation to me, coming from a strong SQL background.
I come from a world of data warehousing with ACID compliant databases. When our team started implementing Cassandra to accommodate high volume analytics, I wanted to get a taste of Cassandra by learning through the tutorials on DataStax.
There are several courses on DataStax, of which I took up “DS220: Data Modeling” covering conceptual data modeling techniques, principles and methodology, design techniques and optimisation.
As stated on the course it was for 12 hours. For the time I wanted to spend, which was 2 days, I could cover about 75% of the course, thanks to NCR Edinburgh giving me dedicated days to take up this learning.
The tutorial was quite easy to get started with as they presented simple use cases and provided ready to use VM images. Start it up and write your CQLs (Cassandra Query Language). As always going through some Stack Overflow pages and the official documents helped.
In this post, are some of the important things that I found to be of interest during my learning.
In Cassandra, a primary key is composed of a Partition Key and Clustering Columns.
The Partition Key is the column which defines the storage of a row in a partition. Every table must have a partition key.
Clustering Columns are used for ordering the storage of records in a partition. This helps for efficient querying for range of rows in a partition.
If a primary key contains only partition key, then essentially it means there is one record per partition.
- Primary key
(a, b, c)→
ais the partition key, and the columns
care clustering columns.
- Primary key
ais the partition key.
- Primary key
((a, b), c, d)→
bare the partition keys, and columns
care clustering columns.
Only columns which are not participating in primary key can be marked static. These columns are stored only once per partition.
In a general RDBMS, the rows in a sequence of time are stored sequentially based on the clustering index, which can help efficiently query on such records in that order.
This might be the most efficient way, but it needs to read the information row after row. However, the concept of wide rows on Cassandra helps to store all of these events in a single row, in an order of cells.
In simple terms, this is a process of maintaining right versions of a record and removing stale rows from a partition. There are several reasons for a record to be stale.
- It depends on the time-to-live property set for a record. If the record crosses the TTL (Time to Live), then it is marked a tombstone.
- If there are multiple versions of a record in a partition, then the latest record is identified and added into the storage. Rest of the rows are left out and marked with a tombstone.
Cassandra does not overwrite directly, but it will store any new records being inserted. Cassandra does not check for any duplicates while the record is being inserted.
During the process of compaction, the older records get deleted from Cassandra. When reading from the partition for a primary key, all the versions are retrieved and only the ones with latest time-stamp are returned.
Cassandra doesn’t provide any option to join between tables. Instead when modeling, a table is created for every query pattern. Data is written on to multiple denormalized tables based on the data modeling outcomes of the query patterns.
Initially it is difficult to come to terms that there are no joins between tables. This causes instability in the minds of a developer from an ever ACID database. The entire concept changes from ACID properties to the CAP theorem.
It’s very much like a standard SQL syntax. However, there are some additional constructs and limitations.
One of the particular limitation to be considered is the where clause which can only contain columns from Primary Key. If other columns need to be used for querying, then we need to create secondary index.
The columns in primary key helps to identify the location of the rows in the cluster.
If a query needs to have equality clause on non-primary key columns, then we can use the option “ALLOW FILTER”, however this way entire cluster is searched for the rows. Usage of this option can lead to unpredictable performance hits.
Further to wide rows, understanding the concept of 2 billion columns has been an uphill for me. This is not literally 2 billion columns. But Cassandra can have 2 billion cells in a partition. If there is only one record in a partition, then in this case 2 billion columns are possible.
RDBMS Data Modeling to Cassandra Data Modeling
In a traditional data warehouse, there multiple layers, each at its own level of normalisation. Reporting schema are generally at a de-normalised form which are efficient for reporting tools to work on.
In such a de-normalised form, a reporting schema is designed either as a Star or a Snowflake design.
Designing such a schema allows for ad-hoc querying and allowing reporting tools to leverage the flexibility.
In Cassandra tables are designed based on the queries that will be used. Access patterns are determined first, and this feeds the design of the tables.
For example, if a query requires a relation on a column, then this must be part of the primary key. Essentially I can say, its not flexible for ad-hoc querying.
Data Modeling in Cassandra
This method is much like the methodology implemented in a data warehouse design. It follows Conceptual Data model, Logical & then Physical data model.
- Conceptual model describes the entities and relations, and results in a ER Diagram.
- Logical data model describes various access queries and leads to creation of Chebotko diagram. There are several mapping rules which help in creating the diagram, which is the table design.
- Physical data model describes various aspects of creating the table and optimisation according to the logical model.
These are just a few concepts I could explore during my training days. My final thoughts on learning Cassandra:
- It’s a database very specific to handling high volumes and a lot of restrictions on querying.
- For creating reports based on fixed queries, it’s a good choice of database.
- Tutorial by DataStax: https://academy.datastax.com/resources/ds220-data-modeling
- The official documentation: https://wiki.apache.org/cassandra/