Coursera’s Adoption of Cassandra

Published in

Coursera Engineering

5 min readSep 23, 2014

By Daniel Chia

Like many startups, Coursera began its data storage journey with MySQL, a familiar and industry-proven database. As Coursera’s user base grew from several thousand to many millions, we found that MySQL provided limited availability and restricted our ability to scale easily. New product initiatives and requirements provided a perfect opportunity to revisit our choice of core workhorse database.

After evaluating several NoSQL databases, including MongoDB, DynamoDB and HBase, we elected to transition to Cassandra . Cassandra’s relative maturity, masterless architecture (for availability), tunable consistency, and stable low-latency performance made it a clear winner for our needs.

Transitioning from MySQL to Cassandra

Given the significant differences between MySQL and Cassandra, the transition to Cassandra took time and thoughtful effort. Relational data models designed for MySQL will perform very poorly on Cassandra, because Cassandra is optimized for many simple point or range queries, while relational data models encourage normalization. MySQL uses joins and filters to extract data, but Cassandra does not support joins, and provides only weak support for arbitrary filtering; instead, each Cassandra query should only read a subset of data for a partition key.

Here are some takeaways from our transition to Cassandra.

Start Small

Don’t migrate your entire product to Cassandra on day one. You will inevitably make mistakes on your first attempt; limiting the scope of early migrations will minimize both migration effort and end-user impact. Start with a small feature for which you can afford some downtime. For example, moving your user login information to Cassandra would probably be a bad idea; better choices would be a new feature that can be slowly rolled out, or an optional feature like a feedback gathering tool that can be omitted if it fails to load.

We started our first Cassandra project with a lightweight polling feature called Quick Questions. As the resident Cassandra expert, I worked closely with the engineers building Quick Questions to ensure smooth development on top of Cassandra.We found it helpful to provide data store libraries, as well as data modelling consulting, both of which are explained in more detail below.

Effect a mindset shift

Cassandra has a very specific set of capabilities. We found that we needed to give developers time to properly understand Cassandra’s data models in order to get good query latencies. As mentioned earlier, treating Cassandra as a relational database like MySQL is a guaranteed way have terrible query performance. Few developers at Coursera had applicable experience, so we used tech talks — informal engineering seminars held over lunch — to help educate our team.

One of the biggest differences between Cassandra and MySQL is that Cassandra requires much earlier development and understanding of intended query patterns. In MySQL, you store your data in a normalized fashion, and then write appropriate queries to retrieve data based on your application needs. In Cassandra, however, you start with your application first, figure out which queries you need to ask of your data, and store your data in a denormalized fashion according to the queries.

For example, at Coursera we have a many-to-many relationship between learners and courses. We need to keep track of for each learner, which courses they are part of, and for each course, which learners are part of the course.

A possible SQL table for this relationship would be:

CREATE TABLE `courses_learners` ( `id` INT(11) NOT NULL auto_increment, `course_id` INT(11) NOT NULL, `learner_id` INT(11) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `c_l` (`learner_id`, `course_id`), UNIQUE KEY `l_c` (`course_id`, `learner_id`) )

Translating this schema directly to Cassandra would be problematic for several reasons:

The use of an auto-increment primary key: We don’t have a way to generate auto-incrementing keys in Cassandra. Instead, we favor UUIDs or natural keys.
Requiring indexes: Our two required queries would require scanning all the data in Cassandra, since we can’t easily create indexes on (course_id, learner_id).

Instead, in Cassandra we’d opt to denormalize our data and write into two tables optimized for our queries:

CREATE TABLE courses_by_learner ( learner_id uuid, course_id uuid, PRIMARY KEY (learner_id, course_id) ) CREATE TABLE learners_by_course ( course_id uuid, learner_id uuid, PRIMARY KEY (course_id, learner_id) )

Here, we’ve used the pair (learner_id, course_id) as the natural key for membership in a course. By denormalizing the data into two tables, we’re able to answer both queries using a single read of one partition in Cassandra, rather than scanning and filtering all data. Cassandra excels at reading one row, or a slice of rows, from exactly one partition. By denormalizing our data and organizing it according to our query pattern, we’re able to ensure we can obtain our data using read patterns that play to Cassandra’s strengths.

Data modeling consulting

Most applications probably don’t have very complicated query patterns, and application developers should be able to easily identify the right library or Cassandra pattern to apply. For the minority of use cases that are more advanced, it’s useful to have a set of experts whom developers can consult.

Libraries

Libraries offer a way to abstract away the complexities of Cassandra for simpler use cases. For example, many applications only need a simple key-value store, as concurrency is low and read-modify-write is not a problem. As a bonus, our libraries also maintain a history of previous values per key to aid with debugging and troubleshooting. We plan to extend the library in the future to make it easy to build custom secondary indexes. In such cases, developers can use a pre-written Cassandra key-value store library, and not have to worry about the low-level Cassandra details at all. The library uses a good Cassandra data model, sets read and write consistency levels appropriately, converts Java Futures to Scala Futures, and might even provide other goodies in the future.

Conclusion

The transition from MySQL to Cassandra has been an interesting one for Coursera. Cassandra has required us to leave our SQL mindset behind and design our data models according to our queries. We can’t say that our journey is complete, but so far, we’ve been reaping the benefits. Maintenance is much simpler, we’ve had practically no downtime, and performance on SSDs has been great with 95th percentile read latencies of less than 5ms.

If working on Cassandra and other infrastructure problems interest you, or you just want to help scale our education platform to reach every learner in the world, we’re always looking for passionate and talented engineers. Check out our careers page or email us at joinus@coursera.org.

Originally published at building.coursera.org on September 23, 2014.