A journey from SQL to (a Document-oriented) NoSQL: Schema manipulation and denormalization

Alex Garofalo
LARUS
Published in
6 min readJul 19, 2022

More and more often in recent times, we come across applications that need a change of DBMS for the most varied reasons. In these scenarios that involve the migration of databases from relational and NoSQL models, it is always necessary to retouch the schema that is migrated to take advantage of the peculiarities of the new DBMS chosen or, more simply, to comply with the best practices in the field of modeling a specific quadrant in the NoSQL world. In this article, we will see how this is possible when we migrate a database from an RDBMS to a document database, in this case, MongoDB.

Normalizing a database is a necessary practice when we are dealing with RDBMS as it allows us to organize and order data in a proper manner, avoid redundancy, and guarantee the consistency of data and their referential integrity. The data schema must be designed and implemented before any business logic can be developed to manipulate the data. Updates can be made later, but significant changes can be tricky.

In a NoSQL database, data can be added anywhere, anytime. There is no need to specify a document design or even an initial collection. A NoSQL database may therefore be more suitable for projects where the initial data requirements are difficult to ascertain, but one should not confuse difficulties with laziness as neglecting to design a good database at the beginning of the project will lead to later problems.

Let’s talk about the performance and use of a normalized dataset schema migrated from an RDBMS, MySQL in this example, to a DDBMS, MongoDB.

in this case, the database used is a well-known one, the MySQL employees dataset, which you can find here.

The employees dataset’s EER diagram

It’s an EER diagram showing the one-to-many and the many-to-many relationships that exist in the dataset. We have two one-to-many relationships: between employees and titles and between employees and salaries. We also have two many-to-many relationships: between employees and departments to represent in which department an employee is assigned and a second relationship between employees and departments to represent who manages a department.

It is not recommended to use the exact same schema on MongoDB since this kind of modeling is not designed to be used by a NoSQL Document DBMS. As a matter of fact, if we tried a simple SQL query that joins the employees with salaries and titles, versus an equivalent aggregation function on MongoDB, MySQL would have the best performance.

Providing a concrete example of this dataset, using MySQL 8.0 and MongoDB 5.0 on the same hardware, the SQL query needs an average of 0,011 sec to run. Instead, MongoDB needs 2 minutes and 19 secs to make the same operation with a normalized schema.

The SQL query
The Aggregation function in MongoDB

This is normal because the structure used for the dataset is perfectly designed for an RDBMS, but a NoSQL DBMS has other needs to be used properly taking advantage of its peculiarities.

To do so the data modeling needs to start from the needs that the database must satisfy. Denormalization is a process that starts from the analysis of the use of the database to improve the schema related to the query that will be provided. If this kind of query is often used in the application, a correct process of denormalization is to collapse one side of the one-to-many relationship in the other to provide a unique document with all the information you need for the query.

An example of this kind of situation, in the dataset we are using, is the relationship between the employees and the salaries and the relationship between the employees and the titles. If the use of data provided by those two collections is always related to the employees collection, a good solution is to collapse the ‘many’ side of the relationship (e.g. a title is owned by one employee, an employee can have many different titles). That is, we bring the documents into a separate collection and merge the information embedding a document with all the information provided by the original record.

A document from the employee’s collection
A document from the employee’s collection with the embedded documents from titles and salaries collections

A handy mode to make this kind of denormalization is to use an aggregate function starting from the ‘one’ side of the relationship, that is employees, and embedding the record from the collections from the ‘many’ side of the relationship using the lookup operator followed by a project operator to avoid the creation of unnecessary information like a duplicate of the emp_no (used to relate the documents to the employees collection ) or the object ID (_id, used to make any document in the collection unique).

denormalizing aggregate function

The obtained output is the document with the embedded array of documents using the titles and salaries properties containing the documents previously contained in the title and salaries collections.

In this way, all the information about the employee is in one document and we query the new denormalized DB to obtain the same information as the normalized version. The execution time is meager (an average of 0,004 sec. in the same configuration as the other DBs).

This kind of denormalization is not the perfect method to denormalize any DB, it is just a proven way to denormalize using a friendly tool already present in MongoDB with this need. It is recommended when there are many read queries that need information provided by the properties in different documents. The embedded documents should not need frequent updates to make this denormalization efficient for all the scenarios.

Denormalizing is not a process with the same stages for all situations, but an analysis made by using the dataset and the needs of the user, the queries to which the database must answer, and needs related to scalability and updates of the information contained in the embedded documents. We can use normalized schemas, but in many cases, this kind of solution provides terrible performance. Also, extreme denormalization (in the worst case putting all the information in a single document) can lead to problems when we need to modify those values (especially if they are duplicated across multiple documents) or add useless complexity to a document with a great number of properties, but we need just a few values for the principal queries.

It is not simple, but the only way to make good denormalization is an in-depth analysis of the needs and the queries that will be made to the database using data we have in the best way with as good a document structure as possible. Fortunately, we have tools like the one we have seen that help us do this job.

--

--