Cassandra Short Note and Polyglot Persistence Experiment
Cassandra (NoSQL database) supports scaling out/horizontal scaling.
The scaling is linear. So what is linear scaling?
If I have 2 servers and each can handle 1000 operations then adding 2 more servers would handle a total of 4000 operations.
That is (2 + 2) * 1000 which amounts to 4000.
The benefit of horizontal scaling is you can make the overall system more powerful by combining the power of several cheap commodity servers.
The opposite of this is vertical scaling/scale up — we usually do that with traditional database management systems like MSSQL. We try to make the database servers more powerful by adding more CPU, RAM etc. This is in general an expensive option.
- To add a new node we do not need to take the system down.
- Continuous availability
- No master slave concept — so no single point of failure.
- Natively distributed — the sharding and replication options are built in.
- Cassandra supports write Scalability and reasonable read scalability.
- Cassandra can handle thousands of concurrent requests.
Language Drivers: available in popular languages.
Big data — Just because data has high volume does not mean we are dealing with Big data. For big data all of velocity, variety, volume are high.
Ideally ensure AP from CAP (Consistency, Availability and Partition Tolerance).
Tunable consistency — as you know that if we want to ensure A of CAP then we cannot ensure C of CAP in a distributed system. But if C is important then we have to let go of the need to have A.
However the system is eventually consistent — given enough time the change will propagate to all servers where the data is replicated.
Cassandra use cases: durable shopping CART, IoT, Analytics, Mobile phone and messaging app, user activity tracking, Product catalog.
The company Change.org is a good candidate to use Cassandra — Petition can be seen as product catalog and user voting can be seen as user activity.
Wide column — when a row (may spread over multiple column families; a row is identified by the same key) contains many columns.
Narrow column: when a row contains less number of columns.
Data that we are writing goes to two places. At first first — disk based ‘commit log’ and Memtable. From Memtable it goes to SSTABLe (which is immutable and on disk).
Compaction — merging multiple SSTAbles into one.
For read Cassandra uses Bloom filter (it is in memory). The Bloom filter indicates probability of having a data in SSTable (remember it is on the disk). The yes from bloom filter is a probabilistic yes but the ‘no’ answer from it is deterministic or dependable.
Sharding /horizontal partitioning as well as replication is baked into Cassandra.
Partitioner that saves data in a way to avoid hotspot (congestion), uniformly spread or randomly spread (one does not imply other) but predictable and reproducible for retrieval. Obviously the partitioner has to depend on a good hash algorithm. But using this dispersed data means that in such a situation range query would not be easy to handle.
So we have a tradeoff — we are avoiding hotspot at the cost of not being able to have efficient range query at the database level. Essentially we have to gather all rows from different places and then combine those.
Cassandra Keyspace is equivalent to a database or schema concepts as we know from the world of relational databases.
Column family is table equivalent. We have to define it in advance but every row does not need to have all the columns from the column family. The columns can be sparse.
Replication is at the database/keyspace level implies different keyspace may have different replication factor.
Same data can be on multiple data center (DC). From one DC we can support the base system operations, from another we can handle analytics — to make sure one type of workload is not blocking the other.
Cassandra does not have a concept of join and it encourages denormalized and query based system design.
Compound Primary Key — partition (like ssn) and clustering (all fields where more than one rows are available for the same partition key.
Cassandra supports Light Weight Transaction (if exists) with insert.
To deal with confusion during update (note: when no pre-existing data with the key exists then ideally an update acts like an insert) we can introduce the concept of row version number.
Transaction — not ACID but AID.
Atomicity at the level of one key.
There is no concept of Primary Key (PK) and Foreign Key (FK) based referential integrity in Cassandra. So consistency is not guaranteed. We have to deal with that at the application level.
A cluster includes nodes where same data is replicated.
Consistency is configurable on per query basis.
Cassandra does not overwrite rows in place — so the storage engine is “Log structured” merge tree. (I hope you recall that then Log entries are added sequentially).
As we are not writing randomly on hard disk; we are able to reduce disk seek.
Also in most cases we are not doing read before write.
Cassandra is good for a write heavy system . An update operation is fast in Cassandra due to the ways rows are added. Basically instead of in place update of pre-existing rows; in Cassandra new rows are added sequentially for an update — which is very similar to the way transaction logs are handled in various relational database systems. But due the same reason (and due to the Tomb Stones added during deletion) the read may not be super fast.
What is “Frozen”? When multiple values are serialized into one and treated as a blob so each part can no longer be manipulated (ie updated) — the group is called Frozen. For non frozen item or column we can update each part of it if we wish. Please note that we cannot use non-frozen collections as primary keys.
I intend to add some note on Quorum, Paxos consensus protocol, Gossip Protocol and Isolation level .
If we change/update Cassandra.yaml we have to restart the instance for that change to be in effect.
An adhoc query is not possible on any column in Cassandra without index on that column. If there is an index then we can do lookup on attributes even though it is not the partition key. To save index information Cassandra uses hidden table (it is a column family too. :) ).
Yet to add note on secondary index, Nodetool utility, Log4j-server.properties (to log more or increase logging levels) etc.
In Cassandra a collection can be treated as field. Why to add collection type as a field? — To save (as for example) more than one email for a user.
What is that? Using multiple database management systems for data persistence is called Polyglot Persistence.
Now let us consider some stories:
Assume that we are a music delivery company which ingests music data from a label; parse, Come up with rationalized rights and policy, encode the audio according to the business partners specification, generate metadata (also called syndication) according to the business partners specifications, combine/package the metadata, audio etc., and then transfer to the location from where the business partner will pull those and ingest those into their systems.
We want to design a flexible (ease of adding and removing nodes; modification of code) and scalable distributed system. How would we do that?
Here we shall explore the whole design and implementation but one piece at a time.
Though the storage need of the use cases can be satisfied with a relational database; relational database is not a good candidate to represent all situations when scalability should be in our mind as well.
Let us think about the profile of labels, users (not viewer but internal staffs for metadata correction, data inspection, supporting investigation on behalf of the labels) and business partners. Those data do not change frequently and the volume is not too much either. So a relational database management system suffices in this case.
However assuming a single user name can have any number of emails and phone numbers (we would need extra tables to represent the 1:m relationship. We can use a document database too in case we want to save user, user email and phone number together. Assuming someone does not own a house but rent it; in that case the address can change frequently. Using document database like Mongodb we can have the benefit of built in replication - so the system would not be poorly affected the way it would be when we try to read all data (bottleneck) from the single node of RDBMS.
Please note that we are not using a traditional (Key, Value) store. The very reason of not using KV store is - in a key value store the value is binary /opaque and we cannot easily modify the value. Using document database or RDBMS makes more sense in such a case as we can update the content in a field.
If we chose to use column family data store (like Cassandra) then we could keep the less changing elements of user profile (firstname, lastname, ssn, phone, email) together in one family — which could be and queried or pulled together; and we could place the more frequently changing user profile elements (like address) in another column family (we got to make sure that except the key there is no overlap in columns in different column families). Of course what changes less frequently and what does not is our assumption until the pattern is clearly observed in real data.
BTW column families has to be defined beforehand. As any change in or addition to a column family would cause database downtime (add the reason here); though adding or removing a node would not.
Product Catalog: ideally we can use document database if there is less number of fields to handle.
When we have to deal with huge number of fields and there is no standard set of columns then we can go for Cassandra.
Car inventory management system — daily feed of 1M entries.
Car dealer — could be key value database; but for part by part lookup a document database may be better suited
Let us consider a hypothetical universal health care system based on (as for example) music therapy. As usual different patients have different needs. In such a system we have to consider storing the following information:
User profile — mostly static with a support for slight changes in current address. Why would the address change? It may change when a person travels due to the need of getting better treatment and we want to keep address history.
Sample fields for this system can be username, password, universalPatientID. For storing those we can have RDBMS or distributed key value store (to distribute data not to have hotspot). When the fields do not change much in such a case we could use column store database/document database to store user information too. But if we want auto sharding and auto replication with none of the machines being the main replica then we can use the column family database.
In the hypothetical universal healthcare system these are the fields which do not change much — Lastname, firstname, emails, phones, social security/national ID/ universalPatientID, DOB, permanent address (a super column as it contains other children columns), and user occupation. We can declare a column family for this.
We can have another column family to save fields whose values change frequently. CurrentAddress, universalPatientID, TimeToLive/TTL (so that the row is automatically removed after the TTL) and history of all such addresses are fields which change frequently.
Coworker relationships such as who is managing whom, who communicates with whom as well as users social network could be stored using a graph database like Neo4J. But we need to consider the data volume here — can NEo4J handle that much data volume?
In every database in the polyglot system the universalPatientID would be the key to represent the user and for linking purposes.
Now what to do to store the profile of a business with branches/or centers worldwide? How to handle the case more efficiently when more and more branches are added/the business expands beyond a country. We know that for some areas we use zip code and for other areas we use postal code. To support such variability we can save the profile of such a business in Cassandra.
We need to store information about therapists, patients and researchers interested in using the hypothetical universal health care system.
We need to consider types of services too. Is the universal/unified health care system providing service at various branches only/ or providing in home service or long distance service?
While storing actual music (as we are considering music therapy based healthcare system), corresponding tone, Gracenote id or Musicbrainz id, we may end up dealing with any number of fields. This implies that this is a case of wide column family which in turns implies that using columnar database would be useful (some data may be non-existent too — so the system is essentially sparse).
Music related document — do we need to save much at all? Would performing lookup on the online music databases be okay ? I have yet to to figure out.
We are using polyglot (multiple and different database systems) persistence. With data store facade in front.
One more interesting piece to discuss — what database to use when?
Riak Versus Redis:
Riak is a distributed KeyValue pair database.
Redis is a KeyValue database as well but not distributed.
So the choice is Riak as we want distributed KV pair database.
Yet to add note on Cassandra caching so that RIAK would not be needed.
We can save a user’s past listen history, play list, recommended music pkgs (used as part of non medicinal treatment) and kind of emotion those caused/cause in a document database (the way we would handle historical order information in an online bookstore).
We know that Cassandra does not have built in support for aggregation like the relational database systems have. To provide analytics on how many times the patient changed places to have better treatment — we can use pre-computed aggregate or Apache Spark based aggregation.
I have yet to add detail on where to store user profile like, dislike, level of education, language proficiency, faith, mood and playlist for the user.