SQL to NoSQL: Guide to Data Modeling for Retail from DataStax

Author: Jeff Davies

DataStax
Building Real-World, Real-Time AI
8 min readNov 1, 2021

--

Welcome to the first installment of this SQL to NoSQL series from DataStax. The purpose of this series is to demonstrate how to solve common database challenges with NoSQL. We’ll start with commonly accepted solutions in a SQL relational database, then show how to accomplish the same thing in a NoSQL environment — except faster and cheaper.

Let’s begin with the most common implementation in databases: the ability to create and fulfill customer orders for products. Below is an Entity-Relationship (E-R) diagram for a standard design solution:

Figure 1: A standard E-R diagram for orders.

This is a fairly normalized view of these entities. If you’re not accustomed to reading an ER diagram, the asterisk character on the relationship lines indicates that there may be any number of associated records. For example, each Customer record may have 0 or more Order records associated with it. Each Order record is related to a single customer, etc. There’s one data point that we have denormalized, and that’s the product’s unit price in the LineItem table. We do this to record the price of the product at the time of sale, which allows the product price to change over time while accurately recording the price at the time of the order.

Caveat emptor: In this post I will show sample solutions to achieve our goals using NoSQL technology. This doesn’t mean that my example designs are the way you need to design your tables! I’m simply demonstrating how the design thinking is different between the relational and NoSQL worlds.

So, how would we model this in a NoSQL database where we don’t use JOINS to relate tables? In NoSQL, table design is “query-driven”. We must identify the queries that we will use before we design the tables. Table 1 shows the queries for our simple purposes:

Table 1 : Queries we need to support.

In the NoSQL world, we design our tables to meet the needs of our queries. This is how we eliminate expensive joins and gain vast improvements in speed. Now let’s take a look at the first query and how we would design our table to meet its needs. Note that the following examples are using DataStax Astra DB, a database-as-a-service (DBaaS) based on Apache Cassandra® that allows you to create databases in minutes with reduced operational complexity.

Query 1 — Get recent customer orders

The title of this query immediately tells us what we need to set as our primary keys: customer_id (which represents a customer) and the order date. Here’s the table definition:

As you can see in the CQL1 snippet above, we’ll create a table dedicated to tracking customer orders. In the relational world, this is represented by two tables: customer and order. In the NoSQL world, we create this relationship in a single table, thus eliminating JOINs in our queries.

Take a look at CQL 1/line 6. Our customer_order table actually joins four of the tables from the relational world: customer, order, line_item and product. It does this by incorporating the line_item and product information into the line_items list. You should read that line as line_items is a frozen list of line_item types. We’ll skip over what the keyword frozen means. The more important question is, “What is a line_item?”

The line_item is a User Defined Type. if you refer to the create_tables.cql file on the GitHub repository for this post, you’ll see the following entry directly above the create table customer_order code:

Essentially, this custom data type represents the data that would normally be JOINed in a relational database. Since the line_items field contains all of the information about the products for the order, and it exists in the customer_order table, the customer_order table effectively joins all four of the tables we modeled in our relational database ER diagram.

Let’s take a look at a sample INSERT statement for the customer_order table and this should become clearer.

CQL 3 shows how you can insert multiple line items into a customer_order table. As you can see, it’s an array of two line_item types. So a single record in the Astra DB database actually encompasses the relationships between the relational DB tables of customer, order, line_item, and product! This is why there are no JOINs in NoSQL.

Let’s compare the relational and NoSQL queries to get the recent orders for a customer.

The relational way

If we wanted to find the five most recent orders for a customer in a relational database like MySQL, Oracle, etc.; our query might take the following form:

SELECT customer.id as cust_id, customer.name as “customer”, order.id as order_id, order.status as status, order.created as create_date, order.total as total, lineitem.quantity as qty, lineitem.unit_price as unit_price, product.name as product FROM customer, order, lineitem, product WHERE customer.id = 1 and customer.id = order.customer_id and order.id = lineitem.order_id and product.id = lineitem.product_id order by order.created desc limit 5;

I’m not going to get into the intricacies of INNER JOINS vs. LEFT, RIGHT or CROSS JOINS. The above SQL may not work “as is” in your relational database. However, it does clearly represent the complexity of the query. There are three joins to get the data that we want:

  • Customer to order
  • Order to line item
  • Line item to product

Additionally, the complexity of the query requires that the query writer:

  • Knows SQL syntax at an expert level
  • Understands the data model intimately

The NoSQL way

Let’s compare and contrast that statement with the same query written for NoSQL:

select * from customer_order where customer_id = 1 limit 5;

Not only is the NoSQL query easier to write and understand at a glance, but it also executes in a fraction of the time that its relational counterpart will. Imagine you had a relational database of 100 million customers, each of which has an average of 10 orders for six products. How long do you think that query will take to execute in a relational database? How many of those queries can you run in parallel before your relational database is overwhelmed?

Remember, this is a very simple example. Real-world databases are much more complex, and asking them to run queries of three, five, or more than seven joins is not unheard of. How do they scale? Vertically. Buy bigger and faster hardware. Spend exponentially more as you scale up your relational database.

What about denormalization?

Good question. Denormalizing tables is done quite commonly because a fully normalized database simply cannot perform at any reasonable scale by today’s standards.

Ask yourself, “What does denormalization look like?” The answer is that denormalization looks a lot like our table definition in the NoSQL world. You “flatten” out the relational hierarchy to eliminate costly joins. So the question is, “Why not do that from the beginning?” That’s the essence of NoSQL. Design your tables for the queries you will run.

But that leads to data duplication!

Yes, it does. Data duplication is not the boogeyman that we were taught in the relational world. Relational databases were created in the late 70’s when storage and memory were wildly expensive by today’s standards. Those days are gone. Memory and storage are cheap. Now we’re focused on the customer experience, and that requires fast reads and writes.

Query 2 — Get order details by the Order ID

Query 1 covers the first use case. A user logs in and wants to see his recent orders, sorted from the most recent to the oldest. That leads us to the second query. In this use case, the user has clicked on an order to see its details. To fulfill this query we create the order_by_id table, very similar to the customer_order table, but with a different primary key (the order id).

Since we know the ID of the order that the user has clicked on, our query is simplicity itself:

Even if you have billions of records, Cassandra can return the one you’re looking for in a matter of milliseconds!

Note: This is one way to handle this query. We could have also created a Storage Attached Index on the order_id field on the customer_order table. As is always the case with design, there are multiple ways to accomplish the same result.

Query 3 — Get the amount of a specific product that was sold last month

For this query, we’re looking for some analytical information about the sales of a specific product. Except Cassandra is not an analytics engine. To get full-on analytics you need to use an analytics engine with Cassandra, like Apache Spark, the ELK Stack, or some other analytics tool.

However, for our purposes, we want to be able to see the sales of a specific product over the last month (or some other arbitrary timeframe). To accomplish this, we create the product_sales table to record the number of each product type that was sold, and its date of sale.

Our table definition is pretty simple. It records the information we need for the query. The primary key is based on the product_id and the order_date. Let’s take a look at our query to get sales or product_id = 1 for 2021–09.

The query will return the following:

Conclusion

I hope you found this article useful for migrating your SQL databases to NoSQL, or using NoSQL for new projects. NoSQL provides a lot of power and speed for a negligible cost. There is some learning curve when moving to NoSQL, like with any new technology, but many of your existing database skills will be of use in the NoSQL world. The upside is: learning NoSQL will enable you to create much more robust databases that operate at blazingly fast speeds.

You can try NoSQL by registering for a free Astra DB account and creating your databases there. You can always start by copying and pasting the sample code from this post into the CQL Console of Astra DB. There’s no software to install and no credit card needed. You’ll automatically be in the free tier, which enables you to perform 20 million reads/writes and get up to 80 GB of storage on your databases each month. This is more than enough bandwidth to do your testing and proof of concept work — even to run a small business!

To keep learning, explore free tutorials on our DataStax Developers YouTube channel, subscribe to our event alert to get notified about our latest developer workshops, and follow DataStax on Medium to keep up with the latest in all things data, open-source, and more.

Resources

  1. DataStax Astra DB
  2. Apache Cassandra
  3. GitHub Repo SQL 2 NoSQL: Customer order examples
  4. DataStax Academy
  5. DataStax Certifications
  6. DataStax Workshops

--

--

DataStax
Building Real-World, Real-Time AI

DataStax provides the real-time vector data tools that generative AI apps need, with seamless integration with developers' stacks of choice.