Optimizing the MongoDB DataLake (Pt 1)

Guy Harrison
MongoDB Performance Tuning
5 min readNov 25, 2019

--

The idea of the “Data Lake” became popular following the explosion of interest in Big Data and Hadoop. The Data Lake was seen as a modern and more efficient alternative to the Enterprise Data Warehouse (EDW). Instead of Extracting, Transforming and Loading (ETL) data into a highly normalized Star Schema, we could simply drop our raw data into a mass storage systems— typically Hadoop, and then analyze the data if and when necessary.

MongoDB introduced its Data Lake product at the 2019 MongoDB world. It’s perhaps more precisely similar to the “External Table” feature which has been available in Oracle and other RDBMS databases for a long time. The External Table is a table whose data resides externally to the database. Likewise, in the Atlas Data Lake, a collection’s data resides not within a MongoDB database but as a file within a cloud object store — initially Amazon’s S3 service.

To create a MongoDB Data Lake, we create a new special-purpose MongoDB server and supply that server with credentials allowing it to connect to one or more S3 buckets. Inside those S3 buckets can be files in Parquet, Avro, JSON or CSV formats — optionally in compressed form. Collections are mapped to one or more of these files and can be queried using standard MongoDB find(), and aggregate() commands. The files are, of course read-only.

There’s a guide to creating a MongoDB Data Lake here.

What no indexes?

We can’t have indexes in a data lake, but we can achieve something a little similar by utilizing multiple files and leveraging MongoDB's ability to create an attribute within the Data Lake collection from the file name.

Let’s say we have two collections — orders and customers. We might set up our DataLake initially with the data for these collections each in its own file:

Then we will have two collections. ­customers will be defined by a single file /customers.json and orders by /orders.json.

Of course, there’s no way to index the entries in these files. However, we can define the collection using multiple files, and use the filename to define an attribute. For instance, we could redefine the orders collection so that orders for each customer are stored in separate files. The name of the file defines the customerId. The definition of the new collection looks like this:

orders is now defined by the union of all of the files in the /orders folder. In the original file, the value of the customerId attribute is included within the file data. But in the new collection, the customerId attribute is set based on the file name. For instance, if the file name is /orders/1234.json, then the customerId will be defined as“1234”.

The interesting thing here is that if we do queries against orders based on the customerId attribute, then the data lake only needs to scan files for the customerIds concerned, rather than having to scan all of the files. You can see this clearly if you do an explain:

The explain output reveals that we only accessed a single partition (eg, file) and even lists the files accessed.

Join optimization

This technique is particularly critical if you are going to perform $lookup operations between collections. As I noted, in this post, indexes are absolutely essential to optimize joins using $lookup — without an index, join performance degrades exponentially as volumes increase.

Consider an aggregation query like this — which returns the top 10 customers by order value:

If the orders collection is created from a single very large file, then that file will need to be repeatedly scanned for every customer. However, if we create a separate file for each customer, then a much smaller file needs to be read for each $lookup operation. The performance comparison looks like this:

Performance of a Data Lake join ($lookup) comparing join data in a single file, or multiple files for each join key

In each case, join performance degrades linearly with the number of rows being joined, but the degradation is much steeper in the case of the single file option since MongoDB must scan this massive single file once for every row being joined.

Multiple file overhead

As you might expect, there are some overheads for using multiple files. Consider this simple aggregation:

When we have all the orders in a single file, this query is almost instantaneous. However, when the data is split across the 5,000 files representing every customer, then it takes a lot longer to process:

Scanning all of the data in a single file, vs split into 5,000 files

The overhead of opening each file dominated performance and led to far poorer scan time. On the other hand, if we are looking for a single customerId, as in:

db.getCollection("orders").find({customerId:"1234"}).count();

Then the performance results are reversed, the multiple files scenario outperforms the single massive file:

Lookup for a single customerId, where all orders are in a single file, vs split up by customer

Conclusion

MongoDB Data Lake is an exciting new service that allows us to leverage the power of MongoDB over files stored on cheap cloud storage.

You can’t directly index the files in the Data Lake, but you can use the file name as a sort of high-level index which provides significant performance improvements over storing data in a single file. This is particularly important when you are performing joins between two Data Lake collections using $lookup.

However, if you always access the complete dataset, scan performance is best when data is in a single file.

ProvenDB integrates MongoDB with the public Blockchain, creating a fully trustworthy MongoDB-compatible database service. ProvenDB allows MongoDB developers to build high-performance applications that include cryptographic proof of data integrity and provenance without having to understand blockchain programming complexities.

--

--