NoSQL — What did we learn — Part 1?

Rahul Shetty
Mar 6, 2019 · 6 min read

This is how we discovered NoSQL. We started working on GCP in 2015. Cloud DataStore was probably the standard database of GCP during that time. To our delight, it served the four objectives we had identified for our system:

  1. Horizontally scalable
  2. Pricing
  3. Automatically scalable
  4. Speed

We then realized that it was a NoSQL database. And thus began our story with NoSQL. We now use it in almost all our products. We learned a few things while using NoSQL ( and Cloud DataStore) during these years

Types:

NoSQL comes in different flavors — Document based (Cloud datastore, MongoDB), Key value-based ( DynamoDB), Column-Store (Cassandra), Graph-based ( Neo4J).

Pick a type based on the data and the queries you perform on them. For example, a graph-based database is good for a social networking site maintaining connections. A key-value database is handy when the data is packed well and fast retrieval is the primary purpose

Azure has a good article on NoSQL databases — Non-relational data and NoSQL

Query centric design:

For starters, we make a good effort not to visualize the schema as a table. The intuition when it comes to a database is to view it as rows and columns. There is nothing wrong in it ( even for NoSQL). But at the start, it helps to break the mold of what an entity, kind, tuple, table means in NoSQL.

Secondly, relational databases take data as a starting and central point. It is all about the structure of data ( including relations), optimization ( reducing repetition) of data. So the question that we ask while designing a SQL database it to think — “What answers do we need to provide?”

On the other hand, NoSQL is schema-less. There is no structure for the data. This mean a ‘row’ ( I know, I know, I am breaking the rule), can have a different number of fields or different types of fields. ( Imagine each row as a self-contained document represented by JSON. Now you can digest schema-less visualization better)

Coming back to design. When you design a NoSQL database, the principle is to think — “What questions( queries) that might be asked of this data?”

Embed:

The best case scenario is that the query directly targets a document ( in the case of a document based NoSQL). This means we avoid JOIN queries. And for that matter JOIN queries are not available in NoSQL databases ( remember non-relational). And since we want documents to be self-contained, we might have to repeat the data.

If you think Normalisation is a virtue, this is nothing short of an anathema.

Eg: SQL: Employee, Contact details, Address could be 3 different tables

NoSQL: Employee details, Contact details, and Address will be one document in the case of a NoSQL design. Being schema-less Contact details could be just email or email & phone number ( the number of fields can vary).

You should try to embed all data in one document under the following circumstances: ( Microsoft article on modeling data)

  • There are contained relationships between entities.
  • There are one-to-few relationships between entities. ( the number of contacts or addresses of an employee are usually less)
  • There is embedded data that changes infrequently. ( e.g: Address, Contact details do not change frequently)
  • There is embedded data won’t grow without bound. ( The number of address and contact details do not grow beyond small limits)
  • There is embedded data that is integral to data in a document.

This quest for self-containment contributes greatly to the read performance.

But while doing so, there will be two things that you will have to do

  1. Denormalisation
  2. Repetition of data

We will encounter them shortly.

Designing for m:n, 1:n queries:

Data does tend to have relations in the real world. It is not always self-contained. Let’s see how NoSQL handles queries that in a relational database would require JOIN.

1:n : Reference

One to many. E.g: Posts -> Comments.

A post can receive many comments. They are unbounded. And the comment list could be frequently updated. Embedding them inside the post for the sake of self-containment will not serve the purpose. The cost of writing & updating is high.

Now, you do not embed. But you reference the post inside of the comment. So it will look like the following:

Posts = { post_id1,….}
Comments = { comment_id1, text, time, post id, ……}

This means now you can run one query to get all the comments with particular post id.

As a rule of thumb, embed the reference in the growing data. But again…you can still use embed if your applications have a low write frequency.

Post = { post_id1, .... [
{ comment_id1, text, time...},
{ comment_id2, text, time...},
{ comment_id3, text, time...}
]
Comments = {post_id1,
{
[
{ comment_id4, text, time...},
{ comment_id5, text, time...},
{ comment_id6, text, time...}
.
.
.
{ comment_id48, text, time...},
{ comment_id49, text, time...},
{ comment_id50, text, time...}
]
},
post_id1,{ [
{ comment_id51, text, time...},
{ comment_id52, text, time...},
{ comment_id53, text, time...}
.
.
.
{ comment_id98, text, time...},
{ comment_id99, text, time...},
{ comment_id100, text, time...}]
}

In the above solution, we break the embeds into buckets so we can manage the size of the array. Embed works too. But it might not be an elegant solution for your problem.

m:n

Things become starkly fluid here. This is where you wish you had used a relational database. You start thinking of duplicating data and moving it into a relational database and so on.

And you might be right.

Till now, we did learn about embedding & referencing. In the m:n landscape, you might do a bit of both. Let us consider the example of Video -> Viewing users

A video can be viewed by multiple viewers. A user can view multiple videos. So if you have a query that wants to get me all the users that have viewed this video and a query that we need to support asking for all the videos this user has watched. Given this how do we design?

Option 1: Referencing

Video = { video_id, title, desc,..., [user_id1, user_id2, user_id3] }
User = { user_id, name, age,.....,[video_id1, video_id2]}

In this case, we have both the arrays being unbounded and could be frequently updated in a popular system. So is referencing still a good idea?

The answer to this depends on your use case — the number of users or videos in your system and their access to each other.

Option 2: Another document

Video = { video_id, title, desc}
User = { user_id, name, age}
Views = { video_id, user_id, time, location}

The above solution does look like a relational one. Right? Now you know that the world is not perfect.

The advantage here is that the view document can store extra information like the time the video was viewed, location and other things.

But the disadvantage is if you have queries like these.

Get all users under 20 who have watched this video.
Get all videos watched by users who are are paid subscribers

The above will lead to multiple reads for a single query. Plus the processing time.

Depending on your visibility to the requirements and future changes, you can do something like this

Views =  { video_id, user_id, time, location , <some immutable user information duplicated here> , < some immutable video information duplicated here> , }

Immutable because, if you embed mutable information, you run into the problem of a huge increase in writes. For instance, each time the user information changes or video information changes, you will have to update all the views. This does not make sense.

But they make some of the above queries possible ( to an extent).

In the next part, we will cover the non-query aspects of NoSQL.

What about ACIDity?

If NoSQL is your primary database, when do you require a SQL database?

How do the implementations of a NoSQL database (like Cloud DataStore), Firebase, MongoDB deliver on the promise of performance?

Do keep reading. And post your thoughts on what you have already read.

Thank you.

GrowthBeats

Product, Tech & Design musings of GrowthBeats

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store