MongoDB Lookups and Populates: An Unexpected Journey

Published in

cameoeng

5 min readOct 9, 2020

Overview

I wanted to share a little journey I went down not too long ago. Recently there has been some talk about how we can be more responsible with our schema design, queries, and overall code architecture at Cameo. So I hope to share some short posts on how we can be more informed about our current architecture. My journey into learning more about populate first began when I saw it being used heavily within many of our API endpoints. Once I figured out that it was essentially a join, my learning stopped. But a few weeks ago a message came across our channel and it spoke to me:

This then propelled me even further down the journey to learn what populate does and how its different than $lookup. Before we dive in I want to set a baseline here first:

The above message is more or less true. Using a populate in MongoDB is essentially the same as using a SQL JOIN in theory but not in practice. MongoDB is most performant when all information needed is on the document at query time. No one is going to argue this here. This same argument applies to $lookup as well.
We are using MongoDB 4.2 — this is important as I’ll explain later.
We are inserting 60000 User Documents and 60000 Order Documents with the user id attached to them. Here is the schema:

const UserSchema = new Schema();
const OrderSchema = new Schema({
  user: {
    type: Schema.Types.ObjectId,
    ref: 'User'
  },
  completed: Boolean
});

How does populate work?

For a short while I just assumed populate was an api method that just wraps the aggregate method for $lookup. Seems obvious, no? Well I was wrong. Had to own up to my mistake as well (double 🔥).

Turns out every time you run populate you are performing two queries. So let’s take the above schema and we want all Orders with their user hydrated on them. When you do Order.find().populate(‘users’) here is what happens:

NoSql Queries

Mongoose: orders.find({}, { projection: {} })
Mongoose: users.find({ _id: { '$in': [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... 60000 more items ] }}, { projection: {} })

Javascript

const orders = await Order.find();
const users = await User.find({ _id: { $in: orders.map(f => f.user) } })const userMap = users.reduce((map, b) => map.set(b._id.toString(), b), new Map())for (const order of orders) {
  order.user = userMap.get(order.user.toString())
}

$lookup vs populate

Setup

mongoose.connect('mongodb://localhost:27017/benchmark_test', {
  useNewUrlParser: true,
  useUnifiedTopology: true,
  poolSize: 1
});
db = mongoose.connection;await db.dropDatabase();const UserSchema = new Schema();
const OrderSchema = new Schema({
    user: {
      type: Schema.Types.ObjectId,
      ref: 'User'
  }, 
  {
    completed: Boolean
  }
});Order = db.model('Order', OrderSchema);for (let i = 0; i < 60000; ++i) {
  await db.collection('Users').insertOne({ _id: i })
}
console.log('Inserted Users docs')for (let i = 0; i < 60000; ++i) {
  await db.collection('Orders').insertOne({ _id: i, user: i })
}await db.collection('orders').createIndex({user: 1});
await db.collection('orders').createIndex({completed: 1});

Just like many DB’s Mongo adds a unique index on the primary key to prevent duplicates (Link).

Many documents > 10000

$lookup query

async function lookup() {
  const times = [];
  
  for (let i = 0; i < 10; ++i) {
    let startTime = Date.now();
    const orders = await Order.aggregate([
      {$match: {completed: true}},
      { $lookup: { from: 'users', localField: 'user', foreignField: '_id', as: 'user' } }
    ])
    times.push(Date.now() - startTime);
  }  return times;
}

populate query

async function populate() {
  const times = [];
  for (let i = 0; i < 10; ++i) {
    let startTime = Date.now();
    const orders = await Order.find({completed: true}).populate('user');
    times.push(Date.now() - startTime);
  }  return times;
}

Less documents ≤ 1000

$lookup query

async function lookup() {
  const times = [];
  
  for (let i = 1; i < 1000; ++i) {
    let startTime = Date.now();
    const orders = await Order.aggregate([
      {$match: {completed: true}},
      {$limit: i},
      { $lookup: { from: 'users', localField: 'user', foreignField: '_id', as: 'user' } }
    ])
    times.push(Date.now() - startTime);
  }return times;
}

populate query

async function populate() {
  const times = [];
  for (let i = 1; i < 1000; ++i) {
    let startTime = Date.now();
    const orders = await Order.find({completed: true}).limit(i).populate('user');
    times.push(Date.now() - startTime);
  }return times;
}

Conclusion

The only times $lookup is more performant than populate is when dealing with a document join count < 20. So if you’re using a findOne or limit(<20) it would be more “performant” to use $lookup. Now while lookup is twice as fast with findOne than populate is the difference in the grand scheme of things is marginal (0.5ms vs 1.5ms). The chart above demonstrates that as the number of documents being joined/hydrated increases as does the query time. For reference querying Order.find({completed:true}takes an average of 24ms.

So when can you use populate or $lookup?

If the situation warrants it.
Basically if you’re dealing with a subdocument that may need to be updated heavily but the find/join will only be across a finite amount of documents <= 100, you may want to just use one of these methods. Your CRUD endpoints become quite complicated when dealing with subdocuments that have a lot properties. So pick your battles, if speed is not necessary and/or the endpoint can be cached then it might be worth to use populate.
When you have no other choice
Sometimes you inherit a schema that has already chosen this path. For the time being it may not be worth it to refactor but it will be pertinent that we note these things that could cause issues during high traffic times.

A few things to keep in mind when using populate or $lookup also:

Always make sure your foreignField has an index. (“Changed in version 3.2: Starting in MongoDB 3.2, indexes can cover an aggregation pipeline. In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents.”)
Avoid running a populate/$lookup on a large number of documents.
Sometimes populate/$lookup isn’t horrible in the correct situations.

Helpful Links

Come Work With Us!

We are tackling a lot of interesting problems at Cameo and are seeking for highly talented individuals. We’re a fully distributed team — and offer high amounts of ownership on whatever project you’re working on. If you have access to stable WiFi and interested in our product, let’s chat. View open positions here!