NoSQL data Modeling

Sagar Arora
7 min readFeb 27, 2023

--

Data modeling is the process of identifying entities and their relationships. In relational databases, entities and their associations define the modeling. Please check out my article to know more about the relational database modeling.

In Cassandra & DynamoDB, data modeling is query-driven. The data access patterns and application queries determine the structure and organization of data which then used to design the database tables.

Unlike a relational database model in which queries make use of table joins to get data from multiple tables, joins are not supported in Cassandra so all required fields (columns) must be grouped together in a single table. Since each query is backed by a table, data is duplicated across multiple tables in a process known as denormalization. Data duplication and a high write throughput are used to achieve a high read performance.

Design Differences Between RDBMS and Cassandra

Let’s take a minute to highlight some of the key differences in doing data modeling for Cassandra versus a relational database.

No joins

You cannot perform joins in Cassandra. Cassandra doesn’t support it. The joins on big table can degrade the performance sometimes. Hence, the cassandra data modeling is done in a way that all the data can be easily lookup with just partition key & clustering column.

No referential integrity

Although Cassandra supports features such as lightweight transactions and batches, Cassandra itself has no concept of referential integrity across tables. In a relational database, you could specify foreign keys in a table to reference the primary key of a record in another table. But Cassandra does not enforce this. It is still a common design requirement to store IDs related to other entities in your tables, but operations such as cascading deletes are not available.

Query-first design

Relational modeling, in simple terms, means that you start from the conceptual domain and then represent the nouns in the domain in tables. You then assign primary keys and foreign keys to model relationships. When you have a many-to-many relationship, you create the join tables that represent just those keys. The join tables don’t exist in the real world, and are a necessary side effect of the way relational models work. After you have all your tables laid out, you can start writing queries that pull together disparate data using the relationships defined by the keys. The queries in the relational world are very much secondary. It is assumed that you can always get the data you want as long as you have your tables modeled properly. Even if you have to use several complex subqueries or join statements, this is usually true.

By contrast, in Cassandra you don’t start with the data model; you start with the query model. Instead of modeling the data first and then writing queries, with Cassandra you model the queries and let the data be organized around them. Think of the most common query paths your application will use, and then create the tables that you need to support them.

Designing NoSQL Database

To understand the data modeling in NoSql database - let’s work out an example. In this tutorial we will build a database for website like IMDB (database directory for movies, ratings, actors & reviews).

Process:

  1. List out all the entities and relations like you are building a relational database.
  2. Next we will define the Application queries required for our system.
  3. Represent the logical data model with Chebotko diagram.
  4. Convert the diagram into NoSql tables.

1. Define entities & relations

Movie
- movie_id
- name
- category
- plot_summary
- s3_image_path
- s3_trailer_path


Actor
- actor_id
- name
- birth_date
- s3_images_path
- active_since

MovieActors
- movie_id (PK)(FK)
- actor_id (PK)(FK)

Reviews
- review_id (PK) string
- movied_id (FK) string
- user_id (FK) string
- review_text string
- num_likes long
- rating double
- creation_timestamp
- update_timestamp

User
- user_id (PK)
- user_name
- first_name
- last_name
- last_login

Watchlist
- watchlist_id (PK)
- watchlist_name
- user_id (FK)
- creation_time
- update_time
- description
- tags

WatchlistMovies
- watchlist_id (PK)(FK)
- movied_id (PK)(FK)

Once we have listed out all the entities in the system as we define for Relational databases, we can redefine it our NoSql database. As mentioned above Cassandra & DynamoDb the datamodeling is done for specific user queries. So next thing to do is to define the Application queries required for our system.

2. Defining Application Queries

Let’s try the query-first approach to start designing the data model for our application. The user interface design for the application is often a great artifact to use to begin identifying queries. Let’s assume that you’ve talked with the project stakeholders and your UX designers have produced user interface designs or wireframes for the key use cases. You’ll likely have a list of shopping queries like the following:

  1. Q1: List movies based on a category such as drama, comedy etc.
  2. Q2: Find information about a given movie, such as its name, actors and ratings.
  3. Q3 Look up an actor for a movie.
  4. Q4 Look at the review comments for a movie.
  5. Q5 Add the movie in the watchlist.

3. Chebotko notation for a logical data model

Similar to ER diagrams, database like Cassandra has it’s own way of representing the entities & relations. Shown below is the Chebotko logical data model:

Each table is shown with its title and a list of columns. Primary key columns are identified via symbols such as K for partition key columns and C↑ (ascending) or C↓ (descending) to represent clustering columns. Lines are shown entering tables or between tables to indicate the queries that each table is designed to support.

4. IMDB Logical Data Model

The figure below shows a Chebotko logical data model for the queries involving movies, review, cast & crew, watchlist etc. One thing you’ll notice immediately is that the Cassandra design doesn’t include dedicated tables for Cast & Crew or reviews, as you had in the relational design. This is because the workflow didn’t identify any queries requiring this direct access.

Let’s explore the details of each of these tables.

The first query Q1 is to find movies based on category, so we’ll call this table movies_by_category. Searching by a category is a clue that the category should be a part of the primary key. Let’s reference the category by name, because according to the workflow that is how users will start their search.

You’ll note that you certainly could have more than one movie for a given category, so you’ll need another component in the primary key in order to make sure you have a unique partition for each hotel. So we need to add the movie key as a clustering column.

Now for the second query (Q2), we’ll need a table to get information about a specific movie. So, we will define movies table. Another approach would have been to put all of the attributes of a movie in the movies_by_category table, but we’ve added only those attributes in movies_by_category that were required by the application workflow.

From the workflow diagram, we know that the movies_by_category table is used to display a list of movies with basic information on each movie, and the application knows the unique identifiers of the movie returned. When the user selects a movie to view details, we can then use Q2, which is used to obtain details about the movie. Because we already have the movie_id from Q1, we use that as a reference to the movie we’re looking for. Therefore the second table is just called movies.

Q3 is to look up the cast and crew in the movie so we can add the table crew_by_movie table. Based on attribute crew_category we can display the director or actor associated in the movie. The top 4–5 actors can be shown based on the attribute screen-time in the movie.

Q4 Next thing is we want to load all the reviews by movie, so we have created a table reviews_by_movie. To add a new review, we can ask the fields like rating, review_text etc from the user. Here, we have also added the num_likes as clustering key to get the reviews sorted by the number of likes.

Q5 is to add this movie in a watch_list. So that a user can check the watch list later to decide which movies to see.

Physical Data Modeling

Once we have a logical data model defined, creating the physical model is a relatively simple process.

Before getting started, let’s look at a few additions to the Chebotko notation for physical data models. To draw physical models, we need to be able to add the typing information for each column. This figure shows the addition of a type for each column in a sample table.

Now let’s get to work on the physical model. First, we need keyspaces to contain the tables. To keep the design relatively simple, create a movies keyspace to contain tables for movies and review data. In a real system, we might divide the tables across even more keyspaces in order to separate concerns. Below is one way we can define the physical data modeling for movies:

--

--