Build a Goodreads Clone with Spring Boot and Astra DB — Part 3

Want to learn how to build a quick and simple Spring Boot application that is highly available, easily scalable, and can handle millions and millions of data records? In part three of this blog series, we learn how to create an Apache Cassandra® data model that ensures speed and reliability from our Spring Boot application. To understand more about the design of this application, check out part one and part two.

The Code With Me — Better Reads series from Java Brains focuses on building a clone of the popular website, Goodreads.com, which can handle large amounts of data. In this blog post, we dive into the world of Cassandra Data Modeling. You will understand what factors to consider when modeling schemas for Cassandra, and discover ways to denormalize data to optimize for efficient reads at scale.

Data model

Let’s start with figuring out what our most commonly-used queries will be. We will create an ER diagram with some of our main entities for this application: Book, Author, User, User-Book

The Book Entity: has the properties name, description, cover, id, and date.

Figure 1. The Book entity.

The Author Entity: has the properties id, name, and image.

Figure 2. The Author entity.

Between the two there is a many-to-many relationship. A book can have multiple authors, and an author can write multiple books.

Figure 3. A many-to-many relation between the Book and Author entities.

The User Entity (as in the person who has logged in): has the properties id and name. A user can read (or has read) multiple books, and a book can be read by multiple users. This is a many-to-many relationship.

Figure 4. A many-to-many relation between the Book and User entities.

Finally, the User-Book entity: this will contain the information that a user wants to store about a book, like when a user started and finished reading a book, the status (currently reading or not), and the rating. One user has multiple user-books, or user-ratings. A single user-rating can only be associated with one user, but a single user could have rated multiple books.

Figure 5. A one-to-many relation between the User and User-Book entities.

These are the core entities that we will be dealing with in our application. This is a good starting point for building out our database schema for the application. Next, we need to think about how we can make this fit with Cassandra schema design.

Figure 6. Core entities for the Better Read application.

Application requirements

What are the queries that this application would need to perform? First, for a no-login scenario:

Q1: Give an ID, fetch the book (Book by ID)

  • Each book will have its own page with information
  • We want to build the schema so that this action is super fast

Q2: Find all the given books written by an author (Books by Author ID)

  • If you are on a book page, you can click the author to be brought to an author page. There you will see all the books written by that author.

If a user is logged in, they will also want to get their information, such as whether they have read a book, and their ratings. For a logged-in user:

Q3: Get a User-book by BookID and UserID

  • Basically, we need to get the user-book (from the user-book entity) information when somebody is logged in.
  • When a book page loads for a user, we want to get the information about the current logged in user’s interaction with that book.
  • Q3 returns a single book

Q4: User-Books by UserID

  • When someone is logged into their dashboard, we want all the recent books that a user has read to be displayed showing up as a single query. Q4 will return a list.
Figure 6. Queries that the Better Read application needs to perform.

We have established the only queries that we will be needing for our application. Each one of these queries is going to correspond to a table, which is a recommended best practice. As long as we build a schema which addresses these queries, we are good to go.

We may be duplicating data throughout our tables, but that is to be expected when using Cassandra — and denormalized data in general. Although we don’t care about how efficiently the data is stored, we care about how efficient it is to fetch the data. So, we duplicate the data to make sure each of these queries (fetches) results in just one lookup.

Build the schema

Now we are going to create Chebotko diagrams to give us an idea of what our Cassandra tables will look like. These will include the primary keys we will be using, and also our clustering columns. We should be able to take these diagrams and easily transform them into Cassandra tables.

Let’s start with a books_by_id table. We need a table where someone can look up a book by a given bookID.

  • bookID will be the primary key, so we can look up a book by a given ID
  • All the other attributes will show up as columns
  • authorName will be duplicated in another table, but that does not matter
Figure 7. The books_by_id table.

We need to create a table to fetch books by authorID. This table, books_by_authorID, represents all the books by a particular author (no surprise there).

  • The authorID as the primary key
  • All the other attributes will show up as columns
  • The published date will be the sort criteria. We want the most recently written book to show up first on the author page, in reverse chronological order, or published_dt desc. Basically, this data will be saved in an ordered fashion, with no need to order it at runtime. This is why we make this a clustering column by descending order. This clustering column is basically telling Cassandra that we want the data in this table to be sorted in this order.
Figure 8. The books_by_authorID table.

How is this data managed in Cassandra? Since Cassandra is a distributed database, your data is not going to be all in one server. This also means that your data is not going to be on every node of your cluster (unless you want that, of course). So, In Cassandra, primary keys uniquely identify a row, and must have at least one partition key and zero or more clustering columns.

The partition key here (K) is authorId. Partition keys are how Cassandra distributes data across the cluster. Tables are organized in rows and columns, and groups of related rows called partitions are stored together on the same node (or nodes). Each row contains a partition key and one or more columns that are hashed to determine which node(s) store that data.

Cassandra identifies a hash for a partition using a mumur3 hash function and then it puts the data in the partition. This also makes data retrieval easy, and makes lookups very efficient.

A partition can contain one record or it can contain multiple records. If we have the bookID as the partition, we essentially have a partition of one record each. The authorID in books_by_authorID table has a multi-record partition. We partitioned by authorID, and then that author could have written multiple books.

If an author writes three books, they will all share the same authorID partition key. This also means that all the author’s books will be stored on one node, within the same partition. Having the schema this way also allows us to fetch all of an author’s books without reading from multiple partitions on multiple nodes.

Now let’s move on to the userbook_by_userID query. We will create a table books_by_userID_and_bookID. Given a book and a user, what is the user’s rating of that book? This information will be loaded when somebody is on a book page while they are logged in.

  • The userID will be the partition key, as well as the bookID
  • A user is on a book page, and we want to get the user’s information (ratings, etc) about the book (one record). Hence why we have two primary keys.
  • The properties will be columns in the table (start date, end date, rating and status)
Figure 9. The books_by_userID_and_bookID table.

Finally, we will have a table for all the recent books read by a user, the books_by_userID table.

  • We want all the books that have been read by a user in one place. UserID will be the partition key, but not the BookID.
  • Most recently read books need to show up at the top of the user’s dashboard. TimeUUID for when a book was read will be the clustering column, in descending order. We will also make the status a clustering column to ensure that currently being read books will always be displayed before “already read” books. So we are sorting first on status (currently reading, already read), and then based on time.
Figure 10. The books_by_userID table.

These are our four Cassandra tables:

  • books_by_id
  • books_by_author_ID
  • books_by_userID_and_bookID
  • books_by_userID

Each of the queries we developed earlier maps to a particular table.

Figure 11. The Cassandra database schema for the Better Reads application.

These are the tables that we will need to create in Cassandra. Now we are ready to start building our application. Once we have these tables in Cassandra, our Spring Boot application can store and fetch data to and from them. But, that will be in a later blog post. So stay tuned.

Conclusion

This has been a very different approach compared to how you would build relational database schemas, since we did not translate our ER diagram directly to tables. There are no joins, and we keep data readily available in exactly the format we need for our queries, such that fetching data is very efficient.

Now that we have designed the schema, the next post in this series will focus on setting up a Cassandra instance using DataStax Astra DB. Check out the Github repository with the full code for this project.

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for tutorials and follow DataStax Developers on Twitter for the latest news about our developer community.

Resources

  1. Code With Me — Better Reads
  2. Astra DB: Multi-cloud DBaaS built on Apache Cassandra
  3. Create an Astra DB account
  4. Apache Cassandra
  5. Java Brains on YouTube
  6. OpenLibrary
  7. BetterReads project — Github
  8. DataStax Tech Blog
  9. DataStax YouTube channel
  10. DataStax Twitter for Developers
  11. DataStax Community

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

🌲 TreeDefi ECO AMM Migration 🛫

Top 15 HTML Question That Every Developer Should Know.

Day 9,10,11 | In little steps

What’s the Difference Between Frontend and Backend?

Frontend vs Backend

How we automated exam preparation with Nibo and PowerShell

TIP: PHP Array Dereferencing

ESP32 Devlog 3 — Barometric Sensor

Domino Server Administrator — Your FAQ’s answered

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

Scalling Kafka consumer ( statefulset ) based on Kafka events

Spring Boot — Continuous Deployment on Kubernetes With ArgoCD and GitHub Actions

Dynamic Programming in Go #6: Coin Change Questions

REST API | GraphQL: An Introduction