Cassandra

The new lady in my life.

Just kidding. Cassandra is not that kind of lady. She’s a database. A database that I’ve recently had the pleasure of courting. Ok, maybe that is too far with the lady metaphor. I am writing this article because my team has just gone live with a project using Cassandra, and I wanted to share my opinions about that process, some of the lessons I learned along the way, and my vision for the future of my relationship with Cassandra. Let’s start with my opinions.

First, I would like to provide a bit of context about my previous experience with various databases and with persistence layers in general. I started my career working with MySQL, then I moved onto PostGreSQL, I did a tour with MongoDB, a stint in memcached, and some experimentation with Redis. All-in-all I enjoyed my time with all of them, and each provided its own challenges; however, I feel that PostGreSQL is clearly superior to MySQL. I won’t go into too much detail here as to why I feel this way, but I formed that opinion when I used PostGreSQL to implement a large portion of an application’s business logic. Whether or not that was a good idea is up for debate. Anyway, my opinion is biased in that regard. The easiest database to work with in a high level language out of that list would have to be MongoDB. Writing routines in Javascript against a Document-store provides a lot of synergy that makes writing simple, clean code a breeze. For example, you only need a very thin abstraction layer between you and the database when using MongoDB with NodeJS. Which in turn, means that there is very little to learn if you already are familiar with NodeJS. I hope to reflect this ease of use in my vision of working with Cassandra in the future. So, that’s a brief overview of my previous experience with persistence layers. I forgot to talk about Redis and Memcached, but their simple nature means that they are simple to work with and there is not much else to be said in the context of this article. Now, how does Cassandra fit into this context?

Cassandra’s learning curve is similar to that of SQL in general. It is a different way of thinking about persisting your data. It took me several weeks of implementing features before I really even felt productive using it. And, I know that I still have a long way to go. The hard part for me is first understanding that I need to really be sure I know how I want to read my data before I design the way that I write my data. This is completely backwards from the way that I used to start thinking about designing a schema with PostGres. Or, that I did not even really need to worry about my schema at all like in MongoDB. Well, that’s not entirely true, that last bit about MongoDB, but that is how it feels when you are working with it and NodeJS. Another challenge when using Cassandra is that even when you understand how you want to read your data to solve a particular problem or implement a particular feature, it is difficult to completely understand how you will want to read the same data throughout the rest of your application. Until you get around to implementing the rest of it, anyway. I like challenges though, so Cassandra has piqued my interest.

Now, I am anxious to talk about my vision for working with Cassandra in the future, but first let’s talk about some of the pitfalls we ran into along the way because they will eventually form the gooey center of our Cassandra cookie. Indexes. Too many of them to be precise. What I did not understand about indexes in Cassandra is that they form partitions in your data, altering the way that it is persisted. In the SQL variants I mentioned above, indexes are a separate entity altogether and are maintained as such, and generally they do not affect the way your data is persisted. At least not in a way that the programmer needs to be conscience of it. Taking advantage of the way that Cassandra persists its data has to be an overriding concept in all schema design considerations. Too many partitions and your data will be fragmented making it difficult to write data [1], too few partitions and your data can be segmented across several internal structures making it difficult to read data [2]. The dichotomy just established should not be viewed as a law, its just an abstraction and maybe an obfuscation of the true inner workings of Cassandra. Be sure to check out those reference links to gain a deeper insight. Especially the second one.

Now that we have a better understanding of some the challenges the developer faces when working with Cassandra, let’s talk about a framework for dealing with those challenges. There are two important concepts that form the foundation of this framework: Buckets and Relationships. I will capitalize them because I will be eventually assigning them a specific definition.

A Bucket is simply an identifier and a document. Since we are working with NodeJS, it makes sense to use JSON as the structure of the document, so that is how I will refer to documents. I suppose “it makes sense” may not be a good enough reason to use JSON, so let me point out a few other reasons. JSON can be natively serialized and deserialized in NodeJS which makes it a fast and efficient operation. It is also trivial to reference a deserialized JSON object using Javascript because the interface is both well-defined and familiar to the Javascript developer. If another type of document were chosen, XML for example, then a non-native library would need to be introduced whose interface would need to be learned and whose performance would need to be optimized. And, this argument holds true for any non-native document type. That is good enough for me, but we are getting mired in implementation details. Let’s return to the framework.

In my team’s projects, we have seen a steady pattern emerge for how data is accessed. Reads are the most common operation, followed closely and occasionally superseded by writes, and finally followed by updates. So, the framework should reflect these priorities. This is the reason that I introduced the idea of Documents. Documents provide a way to read and write a subset of the total data in a common way, but they are not simple to update. Updates are the least common operation though, so it is okay if we lose a little efficiency in this regard. I will, however, propose one way that updates on documents could be handled: transactionally as changesets. If a client wants to update, for example, a user’s username, then the updated value (the new username) along with the value that was updated (in the case of JSON, this would be the key, and in the case of this example this would be the username) and the time that the update was made will be stored as a transaction that will eventually be transmitted to the server. This eventual transmission allows for offline modifications of documents. The server will be responsible for applying the latest (according to the timestamp) changes to the documents and persisting the new document to Cassandra. This update process is a bit lengthy, but that is by design. Also, if you do not exactly follow me just yet, fear not, I will include a formalized ruleset for construction of Buckets along with formalized rules for the rest of the framework constructs at the end of this article. I just want to describe these concepts from a high-level in this introduction to them.

The next and final concept is called a Relationship. A Relationship is exactly what it sounds like plus a little bit more. Its a standard “has-many”, “has-one”, etc. type of Relationship as well as a definition for how that relationship will be referenced. So, for example, a User’s relationship to Photos (both of which are Buckets) would hypothetically be A User has-many Photos and a User references those Photos in the order that they are created. That’s all well and good, but a Relationship is more than that. It can represent any interaction, indeed any “User Story” or use case that needs a level of persistence. For example, a typical “User Story” would be “As a User, I can follow other Users”. How do we define this in terms of Buckets and Relationships? Each “User Story” contains an actor and an action. The actor is a Bucket and the action is a Relationship, and since we know that in this case we want to remember, or persist, the result of this “User Story”, then we must apply the Buckets and Relationships framework. So, that’s a high level view of the Buckets and Relationships framework for working with Cassandra and NodeJS (later referred to as Kasnoden).

Splitting these two concepts up is key to the success of an application that uses Cassandra as its persistence layer. It will allow the developer easy access to the data that he references most often, the Documents, and a clear way to define, without excessive indexing, the relationships and interactions within his/her application. Now, I’d like to provide a slightly more formal definition of this framework.

However, a brief interlude is in order. I normally try to limit these blog posts to my opinions, but it seems like a good platform to share some of my visions as well because from opinion comes hypothesis and from hypothesis comes solutions. At least, that’s my opinion about how the problem solving process goes, sometimes. Therefore, it should be noted that this vision or framework is only a starting point. It is likely to go through several iterations before it ever makes it into my team’s code. So, the following rules are meant to be as general as possible. It is only through the application of the rules that their usefulness will be realized.

The Kasnoden Framework

(pronounced: Ka snowed in)

Bucket rule: must be a set where each entry is { identifier, document }.

Identifier rule: must be unique among the Bucket set.

Document rule: a single structure that must remain consistent in the Bucket.

Relationship rule 0: must be an ordered set where each entry is composed of the same combination of sorters and identifiers.

Relationship rule 1: the order of the set can be derived from the combination of sorters.

Sorter rule: any mechanism that elucidates the order in which Relationships are stored. This is usually a timestamp, but it could be a edge in a graph, a priority, or indeed any construct that has an order when compared to itself. It could be as arbitrary as “cats are greater than dogs”.

Now, I have mentioned one possible theorem of this framework already, but I would like to reiterate and list a few more that should help make it clear how to build something using Kasnoden. By theorem, I mean a thing that describes a way to produce a production. And a production is a piece of code that is written within the framework. I am introducing a little bit of a formal system in this regard, but leaving out some key parts that prevent me from calling this a truly formal system. It is more of an amalgamation of practice and system. Anyway, these are not the only theorems in the framework, nor should these theorems be considered rules, but any future theorems must satisfy the rules stated above so that the system is preserved. In this way, the rules become a decision procedure for whether or not a statement is a theorem.

Theorem 0. If the content of a document changes, the changes should be recorded as a transaction. Only the transactions should be distributed between systems.

Theorem 1. Relationships are used to define a subset of a Bucket such that the result of a query against a Relationship should always be a list of identifiers.

Theorem 2. If the structure of a document changes, then all of the documents in the Bucket need to be updated to preserve the Document rule.

So, that’s all for now. I will write a follow up article containing more theorems and describing the life of the Kasnoden framework thus far in the coming months. Thanks for reading.

Show your support

Clapping shows how much you appreciated Chris Lacko’s story.