Redesigning Persistence Entities For Nosql:

Normalization in RDBMS

We use 3rd normal form of RDBMS data modeling generally. Even if the borders are not so strict, normalization forms may be summarized as below. (There are gray areas between forms)

1st normal form means dividing data into a non-divisible (atomic) model and setting up relations between these models. For instance if, all the photos of an album is stored in columns like photo1, photo2, photo3… in album table, this violates first normal rule. Instead, we need to create a photo table and store the photos row by row. And refer to the album from photo rows.

2nd normal form means that, all the attributes in a model must depend on the candidate key (simple key and composite key). This attribute must depend on all the columns in the composite key. And ask this question for all columns in the table: “Can this column exist without one of the other parts of the concatenated primary key?” If answer is yes, separate this data from this model.

3rd normal form adds one more rule: Referential integrity should be satisfied. Data could not be deleted if another entity depends on it. Functional dependency between tables should be preserved.

Breaking the normalization for better

Data normalization is the accepted paradigm for data modeling in RDBMS an we can not argue the value of it. But we may break it for query performance and horizontal scalability with NoSQL.

That means NoSQL does not replace SQL and data normalization forms but only breaks some rules for better performance by exchanging some consistency and integrity guarentee. (actually all properties of ACID)

The normalization rules form a strict structure and does not depend on the usage scenarios. Rules are rules!
Usage of Nosql data models lets us remodel the entities for better query performance depending on the usage scenarios.

Designing a NoSQL data model

1- Break the first normalization rule for specific query performances

We break the first normalization rule which says “divide the model into small pieces that can not be divided any more” and prevent data duplication across rows.

Data duplication is prevented but accessing the whole object by joining multiple tables might be expensive. You need to join the small pieces. Lets say we are storing all the pieces of a bicycle. We may easily find the wheels from wheel bucket and the seat from the seat bucket since the seat and wheel numbers are limited.

But what happens when you search for hundreds of specific screws from the screw bucket containing thousands of screws? That will be more time consuming.

Since join performance depends on the table that has largest data set (in case indexes are created perfectly) and it is “screws” table now, we may exclude it from the join by embedding them into the entity that will be used together mostly.

Nosql data model says, “try to bundle the immutable items that hard to find into the parent model”. In this example, we may bundle the screws into the wheel or into the body object, depending on the scenario. But what if the screws are not immutable?

This introduces us “query patterns” problem. We need to estimate the select/update queries that will be used mostly.

When the child documents are mutable and embedded into a parent documents, updating a single child document will require searching and updating several parent documents.

2- Break the second normalization rule for specific query performances

Keys don’t have to depend on the values of the entity. We use “logical keys” that are not depend on the attributes of the entity.

We also break the rule of “dependency of all attributes to a specific object identified with a primary key”. And our answer to question “Can this attribute exist without one of the other parts of the concatenated primary key?” is no.

For quick access we bundled the unrelated data.

3- Break the third rule for eliminating foreign key checks or any other checks

Reference integrities might be broken between entities. At least, Nosql vendor does not force you to preserve integrity of the data. You may delete an item that other items depend to.

This is a part of “eventual consistency” problem.

Nosql and Denormalization

Denormalization is our gun and powder. But true usage is so important. While shifting to Nosql, we tried several levels of denormalization for the sake of data consistency. Every denormalization is a crucial decision while migrating to Nosql. You may use the normal forms or denormal forms according to the needs.

Design decisions are ours and tightly bound to the usage scenarios in the application and expected data in the production environment.

If our estimations are not proper, we may harm the performance badly. For instance if the data consistency is crucial later on even if it is not now, remodeling will be so hard.

While breaking the rules we are trying to increase the performance by considering these patterns:

1- Query patterns

  • Which entities will be used bundled (embedded) with other entities
  • What will be the read-write frequencies of these specific type of entities?

2- Production data patterns:

  • What will be the cardinality for relations: One to one, one to a few, one to many, one to billion?
  • Aggregation (like counting, summing some values), grouping will be done over and over again?
  • Is eventual consistency harmful for us?

How to set up entity relations?

1- Embedding the referred document

2- Just putting the key of the referred document and storing the referred document independently. If the cardinality is something like 1 to 10000 ( one to zillion) reference storing should be done vice versa.

3- Using an intermediate document for relations if the cardinality is many to many.

On this presentation, the decision flow according to production data patterns and query patterns is summarized perfectly. Digesting these rules is crucial.

I created a simple cheat sheet for deciding how to store entity relations according to cardinality, read-write frequencies. First we check the cardinality to use embedded documents or separate documents and then by checking read and write frequencies of parent/child documents, we decide where to store the references. Green rectangles are our decisions.

Embed array of B references into A means: Get the keys of B documents related with A and put them into an attribute inside A document.

{
otherAttributes of A …,
“bReferences”: [b1Key, b2Key, b3Key…]
}

This is not the latest decisions.

Because putting the keys into an array introduces another problem: If a B document is deleted, its key will still exist in the array inside A document. How to keep the integrity even if the documents are deleted? This is also a “production data” problem.

In Couchbase, we keep the integrity by storing the relations inside a map-reduce view instead of putting them inside arrays. When a related entity is required, we first query the view and then get the documents.

This is a quick decision tree and I would be very happy if you contribute me for developing the tree furthermore.


Click heart to spread the word.