Normalizing your data with normalizr

Miguel Oliveira
Feb 11, 2019 · 11 min read

How did it take me this long to learn it?

When developing a React application, you almost always need to traverse, either an array or object keys, in order to display data. Could be to display it in a table, or a list, etc..

Since data could potentially have hundreds, thousands or even hundreds of thousand elements, it is important that we traverse it as quickly as possible in order to give our users the best experience they can possibly have. This is dependent on how the data itself is built.

Let’s see an example (based on the example from the official library of normalizr which we’ll see further on).

Let’s say we have a blog website where people can post articles and/or comment on other peoples’ articles (sound familiar?).

So, we query the api and get a response looking something like this:

Basically, we get an array of articles, in which each article has an author, id, title and an array of comments. Each comment has a commenter (which is an author), an id and the comment’s content.


Sidenote

This data was generated using https://www.json-generator.com/#, which is a great website to generate mock JSON.

You can use the following JSON generator schema to reproduce the same structure as above:

[
'{{repeat(5, 7)}}',
{
id: '{{objectId()}}',
title: '{{lorem(3, "words")}}',
author: {
_id: '{{objectId()}}',
name: '{{firstName()}} {{surname()}}'
},
comments: [
'{{repeat(5, 7)}}',
{
id: '{{objectId()}}',
content: '{{lorem(1, "paragraphs")}}',
commenter: {
_id: '{{objectId()}}',
name: '{{firstName()}} {{surname()}}'
}
}
]
}
]

Then just tweak it a bit so multiple people have multiple articles/comments and you are good.


So anyways, now that we have our data, let us use it.

One common use case in our application would be to retrieve all the articles given a user id.

We would do something like this:

const id = '5c603f744ac8328b664324ea'const articles = data.filter(article => article.author._id === id)

Which is good. We traverse all the articles once and find the ones that belong to our author.


Next, if we wanted to get all the comments from that user, we would do something like this:

const id = '5c603f744ac8328b664324ea'
let comments = [];
articles.forEach(article => {
article.comments.forEach(comment => {
if (comment.commenter._id === id) {
comments.push(comment);
}
})
});

Aside from this not being very elegant, this is where it becomes a problem. Not for small data, but imagine our website becomes huge and we now have hundreds of thousands of articles and even more comments.

The above has a time complexity of O(n*m), where n is the number of articles and m is the max number of comments:

For 1000 articles, each with 50 to 100 comments: ~10 ms
For 100.000 articles, each with 50 to 100 comments: ~40 ms
For 500.000 articles, each with 50 to 100 comments: ~85 ms

This is not very noticeable because we are only doing one operation inside the inner for loop, but imagine this was happening in a reducer, and instead of pushing the comment, we would call another reducer with the comment object to do something else. Doing this hundreds of thousand times in a short timespan will get problematic real quick.


Last one.

How can we get the user, with the most comments?

We could do something like this:

let commentsPerUser = {};
data.forEach(article => {
article.comments.forEach(comment => {
if (commentsPerUser[comment.commenter._id]) {
commentsPerUser[comment.commenter._id]++;
} else {
commentsPerUser[comment.commenter._id] = 1;
}
})
});
let userWithMostComments = {};
Object.keys(commentsPerUser).forEach(userId => {
if (commentsPerUser[userId] > (userWithMostComments.comments || -1)) {
userWithMostComments = {
user: userId,
comments: commentsPerUser[userId]
};
}
})

It might not be the most elegant solution, but it works.


Hopefully with these examples you can see where I’m going with this. This is a super easy example, and not only is it not obvious how we get data from this dataset, it is also not efficient.

Enter normalizr

https://github.com/paularmstrong/normalizr

normalizr aims to solve this problem by allowing us to restructure the data to a way that actually suits us, and not to be limited by the api if we do not want to go through all the work of writing something to change the dataset ourselves.

One of the most obvious problems our dataset has (besides the ones mentioned above) is multiple sources of truth. We can see that, here:

Both author and commenter, have a user attached to them. But instead of being a reference pointing to a user, it is an actual definition of a user. This could cause problems in multiple ways. How would we handle a case where the key points to a different name?

Let us say that the author comments on their own post. That user would be defined twice in the article, instead of it being defined in one place only. This would be less prone to errors and bugs in the api or even in our own code.

So let’s change it and then we can compare!


The normalizr way

Normalizr works by defining schemas and then declaring how these schemas are represented through entities.

For example, if we think about which entities we have in the context of our dataset we get the following:

  • Users
  • Comments
  • Articles

So let’s create them.


Users

import { normalize, schema } from 'normalizr';const user = new schema.Entity('users')

Here we are creating a new entity called “users”. We’ll see more in a bit.

Comments

If we look at our dataset, we see that a comment has the following properties:

We want to remove multiple sources of truth like we said above and we know that the commenter is always a “user”. So, instead of declaring the user in the “commenter” key let’s just reference it.

const comment = new schema.Entity('comments', {
commenter: user
})

We are saying here, that we want to keep every key as is, except for the “commenter”, which we want connected to the user entity. This will replace the “commenter” value with an id, referencing a user.


Articles

Now let us analyze what an article has.

So we see, that we have an author (which is a user), an array of comments (which belong to the comment entity), an id and a title.

So the entity becomes something like this:

const article = new schema.Entity('articles', {
comments: [comment],
author: user
})

Output

Now, if run this schema with normalizr:

const normalizedData = normalize(data, [article])

We need to provide the base entity as second argument. We now that our dataset is an array of articles, so we pass that as the second parameter.

This is what we get.

This is great! Well, kind of.

As you can see, we now have an object instead of an array of objects as before.

We have an articles object, indexed by each article id (great!) and a comments array which indexes to each comment (single source of truth)!

We now, also have a comments object, indexed by each comment’s id (great!), and each commenter is an id to a user! (gre…wait..)

Well, it’s almost a victory. As you can see, each comment’s commenter should be an id referencing a user but is, in fact, undefined. God damn it.


This was actually intentional as I want to demonstrate something which is the following.

If we look at an article, we have a comments array with ids to those comments. Like this:

However, we haven’t actually told normalizr which field was the id. This only worked because normalizr uses the field “id” as the id by default. And since we have that field in each comment, it works.

However, a user doesn’t have that field. It, instead, has “_id” as the id field. So normalizr can’t do anything because it doesn’t find an id field. Unless we tell it what field is the id field. Which we’re gonna do, like so.

const user = new schema.Entity('users', {}, {
idAttribute: '_id'
})

We haven’t actually talked about the third param of “Entity” yet, but it accepts an object with some properties. One of which is the “idAttribute”, that solves the exact problem I have just described.

Now, normalizr will know to use the “_id” field in user as the id field.

If we run normalizr again on our dataset with this updated schema, we get this:

Now we’re talking! How awesome is this??

It did this, with almost no configuration besides those little entities!


Going beyond the basics (not too far!)

Okay, so this was almost done automatically and it is already better than what we had before! But I want more.

Let’s play a “What if..” game. I’ll start.

What if, each user had a comments field, which displayed all the comments written by them?

We can do that!

For that, we will use the third parameter once again, passed to “Entity”.

If we look at the api, we can see the following 2 methods we can pass in.

  • : Strategy to use when merging two entities with the same value. Defaults to merge the more recently found entity onto the previous.
  • : Strategy to use when pre-processing the entity. Use this method to add extra data, defaults, and/or completely change the entity before normalization is complete. Defaults to returning a shallow copy of the input entity.
    Note: It is recommended to always return a copy of your input and not modify the original.
    The function accepts the following arguments, in order:
  • : The input value of the entity.
  • : The parent object of the input array.
  • : The key at which the input array appears on the parent object.

https://github.com/paularmstrong/normalizr/blob/master/docs/api.md#entitykey-definition---options--

We’re going to use both of them for this.


processStrategy

According to the api:

Strategy to use when pre-processing the entity. Use this method to add extra data, defaults, and/or completely change the entity before normalization is complete.

Let us start by seeing what is passed to this method when we add it to the “users” entity.

const userProcessStrategy = (value, parent, key) => {
console.log(_.cloneDeep(value), _.cloneDeep(parent), _.cloneDeep(key));
}
const user = new schema.Entity('users', {}, {
processStrategy: userProcessStrategy,
idAttribute: '_id'
})

We use cloneDeep from lodash in the console.log only to make sure that the content of the reference we are printing isn’t changed after the fact.

This is what we get:

Value

Simply our “user”. Nothing new here.

Parent

This is our not normalized comment.

Key

The key we are analysing and afterwards changing (we are getting one other key, but let us ignore it for now).


Next, we are going to edit the method like so:

const userProcessStrategy = (value, parent, key) => {
if (key === 'commenter') {
return {
...value,
comments: [parent.id]
}
}
return value;
}

This might not make sense now, but we are adding a new field to each “commenter”, so we can later merge them all together.

The field we are adding is “comments” and we are setting it to be the “id” of the “parent”, which we know from above to be the comment itself.


If we run normalizr now, we get the following:

It does not work yet, because no user has more than one comment and we know that this is not true. But hey, progress.

What is happening is what it says on the description of “mergeStrategy”.

  • Strategy to use when merging two entities with the same value. Defaults to merge the more recently found entity onto the previous.

When it find 2 users with the same ID, but different elements in the “comments” array, it does not know how to handle the merge, so it just merges the “new” user to the “previous”. We can solve this by overwriting the mergeStrategy.


mergeStrategy

Let us see what we get when we add this to our users entity.

const userMergeStrategy = (entityA, entityB) => {
console.log(entityA, entityB);
}
const user = new schema.Entity('users', {}, {
processStrategy: userProcessStrategy,
mergeStrategy: userMergeStrategy,
idAttribute: '_id'
})

We can see the problem clearly here. The user is the same, just the comments differ.

So we can simply tell normalizr how to merge the comments.

const userMergeStrategy = (entityA, entityB) => {
return {
...entityA,
...entityB,
comments: [...entityA.comments, ...entityB.comments]
}
}

And bingo, we have this:

Awesome!


What if, we also wanted an “articles” key with all the articles written by that user?

It is basically the same thing.

Since we want this information to appear on our users entity, we need to modify our “userProcessStrategy”.

We will now handle the other key, which we ignored previously:

Both keys are displayed in the last line of each image (the first key is “commenter” and the second is “author”).

We can see that when the key is “commenter”, the parent is the comment and when the key is “author”, the parent is the article. So we do this:

const userProcessStrategy = (value, parent, key) => {
if (key === 'author') {
return {
...value,
articles: [parent.id]
}
}
return {
...value,
comments: [parent.id]
}
}

(This is of course assuming we are not getting a third key, otherwise we would also do an “if” statement for the “comments” key)

And then we merge:

const userMergeStrategy = (entityA, entityB) => {
return {
...entityA,
...entityB,
comments: [...(entityA.comments || []), ...(entityB.comments || [])],
articles: [...(entityA.articles || []), ...(entityB.articles || [])]
}
}

This gives us the following object back (analysing one user):

Awesome!


Now, to wrap this up (10min read already, yikes!), let us just try to solve the 3 problems from the beginning of the article.

Retrieve all the articles given a user id

entities.users[id].articles.map(articleId => entities.articles[articleId])

Get all the comments from a user

entities.users[id].comments.map(commentId => entities.comments[commentId]);

Get the user with the most comments

let userWithMostComments = {}Object.keys(normalizedData.entities.users).forEach(userId => {
const userComments = normalizedData.entities.users[userId].comments || [];
if (userComments.length > (userWithMostComments.comments || -1)) {
userWithMostComments = {
user: userId,
comments: normalizedData.entities.users[userId].comments.length
}
}
})

Hopefully you can see, both the differences and advantages in using this approach as opposed to the original.


Anyways, that’s it for this week. Hope you guys enjoyed it!

As always, if you could clap or follow, that would be great, so I can shamelessly put it on my resume & LinkedIn.


Quote of the week

Code is like humor. When you have to explain it, it’s bad.

Cory House

See ya next week! 👌

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade