Build a Goodreads Clone with Spring Boot and Astra DB — Part 3
Author: Pieter Humphrey
Want to learn how to build a quick and simple Spring Boot application that is highly available, easily scalable, and can handle millions and millions of data records? In part three of this blog series, we learn how to create an Apache Cassandra® data model that ensures speed and reliability from our Spring Boot application. To understand more about the design of this application, check out part one and part two.
The Code With Me — Better Reads series from Java Brains focuses on building a clone of the popular website, Goodreads.com, which can handle large amounts of data. In this blog post, we dive into the world of Cassandra Data Modeling. You will understand what factors to consider when modeling schemas for Cassandra, and discover ways to denormalize data to optimize for efficient reads at scale.
Let’s start with figuring out what our most commonly-used queries will be. We will create an ER diagram with some of our main entities for this application: Book, Author, User, User-Book
The Book Entity: has the properties name, description, cover, id, and date.
The Author Entity: has the properties id, name, and image.
Between the two there is a many-to-many relationship. A book can have multiple authors, and an author can write multiple books.
The User Entity (as in the person who has logged in): has the properties id and name. A user can read (or has read) multiple books, and a book can be read by multiple users. This is a many-to-many relationship.
Finally, the User-Book entity: this will contain the information that a user wants to store about a book, like when a user started and finished reading a book, the status (currently reading or not), and the rating. One user has multiple user-books, or user-ratings. A single user-rating can only be associated with one user, but a single user could have rated multiple books.
These are the core entities that we will be dealing with in our application. This is a good starting point for building out our database schema for the application. Next, we need to think about how we can make this fit with Cassandra schema design.
What are the queries that this application would need to perform? First, for a no-login scenario:
Q1: Give an ID, fetch the book (Book by ID)
- Each book will have its own page with information
- We want to build the schema so that this action is super fast
Q2: Find all the given books written by an author (Books by Author ID)
- If you are on a book page, you can click the author to be brought to an author page. There you will see all the books written by that author.
If a user is logged in, they will also want to get their information, such as whether they have read a book, and their ratings. For a logged-in user:
Q3: Get a User-book by BookID and UserID
- Basically, we need to get the user-book (from the user-book entity) information when somebody is logged in.
- When a book page loads for a user, we want to get the information about the current logged in user’s interaction with that book.
- Q3 returns a single book
Q4: User-Books by UserID
- When someone is logged into their dashboard, we want all the recent books that a user has read to be displayed showing up as a single query. Q4 will return a list.
We have established the only queries that we will be needing for our application. Each one of these queries is going to correspond to a table, which is a recommended best practice. As long as we build a schema which addresses these queries, we are good to go.
We may be duplicating data throughout our tables, but that is to be expected when using Cassandra — and denormalized data in general. Although we don’t care about how efficiently the data is stored, we care about how efficient it is to fetch the data. So, we duplicate the data to make sure each of these queries (fetches) results in just one lookup.
Build the schema
Now we are going to create Chebotko diagrams to give us an idea of what our Cassandra tables will look like. These will include the primary keys we will be using, and also our clustering columns. We should be able to take these diagrams and easily transform them into Cassandra tables.
Let’s start with a
books_by_id table. We need a table where someone can look up a book by a given
bookIDwill be the primary key, so we can look up a book by a given ID
- All the other attributes will show up as columns
authorNamewill be duplicated in another table, but that does not matter
We need to create a table to fetch books by
authorID. This table,
books_by_authorID, represents all the books by a particular author (no surprise there).
authorIDas the primary key
- All the other attributes will show up as columns
- The published date will be the sort criteria. We want the most recently written book to show up first on the author page, in reverse chronological order, or
published_dtdesc. Basically, this data will be saved in an ordered fashion, with no need to order it at runtime. This is why we make this a clustering column by descending order. This clustering column is basically telling Cassandra that we want the data in this table to be sorted in this order.
How is this data managed in Cassandra? Since Cassandra is a distributed database, your data is not going to be all in one server. This also means that your data is not going to be on every node of your cluster (unless you want that, of course). So, In Cassandra, primary keys uniquely identify a row, and must have at least one partition key and zero or more clustering columns.
The partition key here (K) is authorId. Partition keys are how Cassandra distributes data across the cluster. Tables are organized in rows and columns, and groups of related rows called partitions are stored together on the same node (or nodes). Each row contains a partition key and one or more columns that are hashed to determine which node(s) store that data.
Cassandra identifies a hash for a partition using a
mumur3 hash function and then it puts the data in the partition. This also makes data retrieval easy, and makes lookups very efficient.
A partition can contain one record or it can contain multiple records. If we have the
bookID as the partition, we essentially have a partition of one record each. The
books_by_authorID table has a multi-record partition. We partitioned by authorID, and then that author could have written multiple books.
If an author writes three books, they will all share the same authorID partition key. This also means that all the author’s books will be stored on one node, within the same partition. Having the schema this way also allows us to fetch all of an author’s books without reading from multiple partitions on multiple nodes.
Now let’s move on to the
userbook_by_userID query. We will create a table
books_by_userID_and_bookID. Given a book and a user, what is the user’s rating of that book? This information will be loaded when somebody is on a book page while they are logged in.
userIDwill be the partition key, as well as the
- A user is on a book page, and we want to get the user’s information (ratings, etc) about the book (one record). Hence why we have two primary keys.
- The properties will be columns in the table (start date, end date, rating and status)
Finally, we will have a table for all the recent books read by a user, the
- We want all the books that have been read by a user in one place.
UserIDwill be the partition key, but not the
- Most recently read books need to show up at the top of the user’s dashboard.
TimeUUIDfor when a book was read will be the clustering column, in descending order. We will also make the status a clustering column to ensure that currently being read books will always be displayed before “already read” books. So we are sorting first on status (currently reading, already read), and then based on time.
These are our four Cassandra tables:
Each of the queries we developed earlier maps to a particular table.
These are the tables that we will need to create in Cassandra. Now we are ready to start building our application. Once we have these tables in Cassandra, our Spring Boot application can store and fetch data to and from them. But, that will be in a later blog post. So stay tuned.
This has been a very different approach compared to how you would build relational database schemas, since we did not translate our ER diagram directly to tables. There are no joins, and we keep data readily available in exactly the format we need for our queries, such that fetching data is very efficient.
Now that we have designed the schema, the next post in this series will focus on setting up a Cassandra instance using DataStax Astra DB. Check out the Github repository with the full code for this project.