Modelling Data and Advanced Mongoose

13 min readNov 26, 2019

These are my notes for Jonas Schmedtmann’s Node.js Bootcamp on Udemy: https://www.udemy.com/course/nodejs-express-mongodb-bootcamp/

What is Data Modelling?

Data modelling is the process of taking unstructured real-world data and putting it into a structured, logical data model. This part of development takes a lot of thought, and there are often no right answers.

Types of Relationships

We can categorize relationships within data into three types: 1:1, 1:Many, and Many:Many. In the case of a movie app, one movie will only ever have one title. That’s an example of a 1:1 relationship, and isn’t super crucial to data modelling.

The most important relationships are 1:Many. In fact, we can further break them down into 1:Few, 1:Many, and 1:Ton. A 1:Few relationship might be one movie and the awards it has received. A 1:Many relationship might be a movie and its hundreds or thousands of reviews. A 1:Ton relationship might be an app and it’s log messages as it grows to millions of users. We need to distinguish between these relationships when we need to determine whether to normalize or denormalize data in our Mongo database. This granularity isn’t necessary in relational databases.

Finally, a Many:Many relationship would be movies and actors. One movie can have many actors, and one actor can appear in many movies.

Referencing vs. Embedding

This concept is also called normalizing vs. denormalizing. A normalized (referencing) database would have separate documents for each movie and each actor. Movie documents then use IDs to reference actors documents. Meanwhile, a denormalized (embedded) movie document would have all the actor information directly on it, i.e. in the form on an array of objects. A denormalized database allows us to make fewer requests and get more information in one query. The downside is, we have limited ability to reference embedded documents on their own.

We can look at three criteria when deciding whether to normalize or denormalize data.

Relationship type- For 1:Few relationships, we always embed. For 1:Ton and Many:Many relationships, we almost always reference. 1:Many relationships are a little fuzzier and could go either way.
Data access patterns- If the data is mostly read and does not change quickly, we should probably embed. If the data is updated frequently, we should probably reference. An example might be a photo gallery related to a movie: once we gather the photos, we probably won’t be updating them too frequently, so we should just embed them. On the other hand, let’s consider movie reviews that users can upvote or downvote. Those vote counts are going to be changing all the time, and we don’t want to query the whole movie document every time a vote is cast, so we should make those reviews their own documents.
Data closeness- Here, we just consider how closely the data sets are related. Users and their email addresses are pretty inextricable, so we embed them. However, for any datasets that we will frequently query independently, we should use referencing.

Types of Referencing

We need to distinguish between three types of referencing: child referencing, parent referencing, and two-way referencing. Let’s consider an “app” document with thousands or millions of “log” documents. In child referencing, the parent “app” document would store the ObjectID of each log. The outcome is that the “app” document becomes bloated with thousands upon thousands of ObjectIDs. In fact, in a large app, we’d quickly encroach upon the 16MB document size limit, so it’s best to save child referencing for 1:Few relationships. A more preferable approach for 1:Many relationships, as is the case with logs, is parent referencing. In this case, the “app” document knows nothing of it logs, but instead each log stores a reference to the “app” document.

In the case of Many:Many relationships, like actors and movies, we can use two-way referencing. Each movie document will contain an array of ObjectIDs to refer to actors documents, and each actor document will contain an array of movie ObjectIDs.

Designing a Data Model

Going back to our tours app, we’re going to have five data sets: Tours, Users, Locations, Reviews, and Bookings.

We’ll start with Reviews. They are children of both Users and Tours, as one user can have many reviews, and so can one tour. Because the number of reviews can grow indefinitely, we’ll use parent referencing.

Next is Locations. We can describe Locations as having a Few:Few or Many:Many relationship with tours, as one tour will usually have a few locations and one location might be used by multiple tours. Alas, since there won’t be that many locations per tour, we’ll use embedding.

When considering the relationship between Users and Tours, we have to remember that tour guides are also users. We have the option of copying the tour guides’ information and embedding it into a tour document, but we’ll run into a problem if we have to change a guide’s information and apply that same update to multiple locations within our database. Therefore, it’s best to use child referencing.

Finally, Bookings has a 1:Many relationship with both Users and Tours. One user can create many bookings, but a booking can only be created by one user. One tour can have many bookings, but a booking can only be for one tour. We’ll use referencing so as not to pollute any of the tour or user documents with booking data.

Geospatial Data with MongoDB

Geospatial data is locations, lines, and shapes represented by longitude and latitude coordinates. Luckily for us, MongoDB supports geospatial data by default. We want to embed location data into our tour documents, so we’ll add a couple of new fields to our tour schema:

We use GeoJSON for the startLocation. GeoJSON always requires a type and coordinates. The type is an object that has its own type, which we must set to String for GeoJSON to work. The value of that string will default to 'Point'. We could make it 'Line' or 'Polygon', but in this case we only ever want to use 'Point', so we’ll set enum accordingly. Then, coordinates is an array of numbers, longitude and latitude. Additionally, we’ll add a couple of our own fields: address and description.

When we want to embed documents into other documents, we need to use an array. So, we’ll set locations to be an array of objects, each of which looks like startLocation. The one difference is the day field, which describes what day of the tour the attendees will arrive at that location.

Adding Tour Guides

As discussed above, we can either embed or reference our tour guides in our tour documents. Although embedding our tour guides won’t be our final strategy, we’ll try it out for the sake of example.

Embedding

The goal is to POST an array of user IDs when we create a new tour and then have our schema automatically fill in the corresponding documents. We’ll start by adding guides: Array to the tour schema. Then, we can create a new tour like this:

The next step is to write a pre-save middleware that fetches user data based on the user IDs.

Here, we’ve marked our map() callback as async because we’re using the asynchronous findById() method. The result of this map() will be an array of promises. Therefore, before we call next() we have to use Promise.all().

Referencing

To implement child referencing, we’ll update our tour schema in a special way:

Now, we don’t have to use a pre-save middleware. We don’t even have to require the user model. Instead, POST an array of object IDs in the guides field. Once we’ve done so, the response of a GET request for that tour will look something like (most fields omitted for the sake of brevity):

To make the tour guides’ information visible we chain a populate() method onto the tour query:

Problem is, to make the population work on getAllTours as well, we’d have to copy and paste the code. Furthermore, we probably have some fields, like passwordChangedAt, that we don’t want to send in our response. We can solve both of these problems with query middleware.

Now the response will have the user data embedded:

Virtual Populate

Let’s suppose we’ve implemented a review model that uses parent referencing:

If we use populate(), we can access tours from the reviews. But how can we access reviews from tours? We don’t want to use child referencing on the tours because the number of reviews will grow indefinitely. Thankfully, Mongoose provides us with a feature called virtual populate. This feature will allow us to populate the tour with reviews without actually persisting an array of review IDs on the tour document. We begin by writing this method onto our tour model:

The first argument is simple the name of the virtual field. Within the options object, ref is the name of the model we want to reference. Next, foreignField is the name of the field in the Review model that contains the reference to the current model (the tour model). Last, localField is the name of the field where the ID is stored on the current model.

Then, to actually populate this virtual field, we just have to use the populate() method within our getTour handler.

const tour = await Tour.findById(req.params.id).populate('reviews');

Our response will now look like this:

Inside our tour response, we have a review that we populate with the same tour. We should get rid of this redundancy. In this case, we can just decide not to populate the tour field in our reviews.

Nested Routes

Right now, we can create a review by manually passing in the tour ID and the user ID. Eventually, we’ll want the user ID to automatically come from the currently logged in user. Also, we’ll want to be able to specify a tour by ID in the URL and then GET or POST reviews, like /tours/5c88fa8cf4afda39709c2961/reviews. That is an example of a nested route: we have two resources in the same URL.

We’ll start by POSTing a review to a tour by ID. This is the route we use:

Now we’ll update the createReview handler to use the :tourId parameter and the current user’s ID. Once we’ve done that, our new route works and we can POST a review to a tour!

That being said, we’ve created a problem for ourselves. For one, we’ve created a route for posting reviews but put it in the same file as our tour routes. Also, the route itself is duplicate code for our vanilla createReview endpoint earlier, so now we have a maintainability issue. The solution is to delete the code we just wrote, require the review router in our tour router file, and write this code instead:

Remember we’ve already used the Express .use() method in our app.js. Now, when a user visits this endpoint, Express will see /api/v1/tours and route the user into the tour router, simply because we’ve written app.use('/api/v1/tours', tourRouter);. Once inside tour router, Express will see /:tourId/reviews and re-route the user into the reviews router.

The issue now is that the reviews router doesn’t have access to the :tourId parameter. Thankfully, the solution is as simple as passing in an options object into the express.Router() call inside the reviews router:

Factory Functions

Right now, a bunch of our handlers are near duplicates of each other. For maintainability, we want to factor our their common points. We’ll start by making a new file called handlerFactory.js and then copying and pasting our exports.deleteTour code inside it. From there, we’ll make a new handler called deleteOne() that takes Model as an argument and returns a generic version of deleteTour:

Then, we can replace deleteTour in the tour controller:

Of course, we can now use this same pattern to implement deleteUser and deleteReview with ease.

We can go even further and make factory.updateOne() and factory.createOne(). We’ll hit a snag when get get to creating reviews, though, because that handler has some specific logic that extracts parameters from the URL and JWT. So, we’ll extract that logic out to a middleware:

We’ll also hit a snag with our GET handlers because getTour has a populate method in it. We’ll account for this method by passing in an options object to our factory function:

Improving Read Performance with Indices

We can see information about our requests by chaining an explain() method on our query:

When we send a GET request to our /tours?price[lt]=1000 endpoint, we’ll get this as part of our response:

As we can see, MongoDB had to scan all nine documents to return the relevant three. With only nine total documents, this isn’t much of an issue, but it would become a huge issue if we had thousands or millions of documents. The solution to this problem is to create indices. Mongo creates an _id index by default, which is something we’ve used quite often to query specific documents.

We’ll create another index for price since that’s another common query. To do so, we’ll simply go to the tour model file and add tourSchema.index({ price: 1 }). The 1 signifies ascending order, while -1 would be descending. If we send that same request again, MongoDB will have only had to scan three documents since they are now indexed by price.

The price index is called a “single query index” because we often query based on price alone. If we often query by two fields at the same time, we should create a compound index. For example, to optimize a query like /tours?price[lt]=1000&ratingsAverage[gte]=4.7, we’ll write tourSchema.index({ price: 1, ratingsAverage: -1 });. In this case, we would not ever need to create additional individual indices for price and ratings; this compound query handles them both.

Why not just index everything? Well, each index uses resources in our database. We must examine the access patterns of our app before blindly creating indices.

Preventing Duplicate Reviews

Ideally, each user should only be allowed to review a tour once. We can’t simply set the user field to be unique, because then a user would only be able to write one review on our site ever. We can’t set tour to be unique because then a tour would only be able to receive one review ever. Instead, we need to set user and tour combinations to be unique.

Again, indices to the rescue! Below our review schema definition, we’ll add this code:

reviewSchema.index({ tour: 1, user: 1 }, { unique: true });

Note: for some reason, indices might not take effect right away. If in doubt, don’t panic and give it some time!

Calculating Average Ratings

Up until now, the ratingsAverage field on our tours has been a hard-coded value, not responsive to the reviews in our database. Let’s fix that! We’ll create an aggregation pipeline like this:

Then, we’ll actually need to call the function. We’ll do so in the bottom of our code using a post-save middleware. Note that we’re using post save because we can’t perform our calculations until the document is in the database.

Here we have a problem. Our middleware is declared above our Review declaration, so it doesn’t have access to the review model. However, if we moved the middleware below Review, Review wouldn’t have access to the middleware. The solution is to use this.constructor, which will point to the model that created the document:

Now we just have to persist these new values on the tour document. The output of console.log(stats) looks like this:

[
  {
    _id: 5ddc8e93c5728c193ee69694,
    nRatings: 3,
    avgRating: 2.6666666666666665
  }
]

So, we’ll access the values inside like this:

We also want to update these values when someone edits or deletes a rating. Unfortunately for us, neither findByIdAndUpdate norfindByIdAndDelete have access to document middleware. They only get access to query middleware. We can make a workaround though! Since findByIdAndUpdate(id, ...) is just shorthand for findOneAndUpdate({ _id: id }, ...) (with findByIdAndDelete being a similar shorthand), we’ll create a hook that uses a regular expression:

Inside this hook, this will point to the current query, not the current review. Therefore, to access the review, we’ll need to execute the query:

Now we’re in a pickle. If we want to run our calculations with the most current data, we’ll need a post() middleware, not pre(). But if use post(), we won’t have access to our query, since it will already have executed. The solution is to turn our r variable into a property, which we can then access in a post() middleware:

Consider this: r points a review. We’ll need access to the review model to get calcAverageRatings, so we write this.r.constructor.calcAverageRatings(). Then, we pass in the tour ID located on that review. And we’re done! Our ratings information will now automatically update upon review creation, edit, and deletion.

Geospatial Queries

Suppose someone wants to participate in a tour but doesn’t want to drive more than 250 miles to get to the start point. A geospatial query is just what we need!

We start with this route:

Because this route has a lot of options by default, we’ve opted to deviate from a normal query string. While a query string would have worked just fine, the above format looks cleaner.

In our handler, we start by using destructuring to get our variables ready:

From here, all we have to do is specify a query string. We’ll specify a startLocation equal to an object with the geospatial operator $geoWithin. We’ll then set $geoWithin equal to a $centerSphere, which in turn is set to an array of 1) an array of coordinates and 2) a radius. Crazy, right? Here’s how it looks:

Note that, counter-intuitively, we switch up our coordinates and pass them in longitude first. Also, we can’t pass in distance directly for radius because MongoDB expects a specific unit called radiants. A radiant is equal to the radius of the Earth, so we calculate radiants like this:

const radius = unit === 'mi' ? distance / 3963.2 : distance / 6378.1;

Next, we need to create an index for the start location. This time, instead of a 1 or -1, we’ll specify '2dsphere' because we’re using real points on Earth’s surface:

Calculating Distances from a Point

Now we want to create an endpoint that gives us the distances of all the tours’ start points from a given location. We start by creating this route:

Then, we create at getDistances handler and extract our variables in pretty much the same we did above. The next step is to make an aggregation pipeline. As it turns out, $geoNear is the only geospatial aggregation pipeline stage. It must always be the first stage in the pipeline, and must always contain field with a geospatial index. For our geospatial index, we’ll use the near operator, which must be in GeoJSON. Below that, on the distanceField field, we specify the name of the field to which the pipeline will output its results. Since the output will be in meters, we’ll convert to kilometers using distanceMultiplier.

Whew! Now, a few finishing touches. First, we want to change the distance multiplier to output either miles or kilometers, depending on the units specified in the URL. Then, for readability, we’ll use a $project stage to limit the fields in out output.