Large Document Storage in MongoDB

Angad Sharma
GDSC VIT Vellore
Published in
4 min readSep 6, 2019

A brief intro to GridFS

Photo by Carlos Muza on Unsplash

Introduction

MongoDB has become the go-to database for no-SQL storage, and is running in thousands of production servers to date. Yet a majority of businesses use cloud native bucket storage technologies like Amazon S3 for their day-to-day file storage needs.

Ever tried saving files directly in our own MongoDB instance? Or even a managed instance over the cloud? In any case, you’ll find answers here.

Pros of saving files directly in MongoDB

  • ACID consistency which includes a rollback of an update that is complicated when the files are stored outside the database.
  • Files will be in sync with the database so cannot be orphaned from it which gives you an upper hand in tracking transactions.
  • Backups automatically include file binaries.
  • More Secure than saving in a File System.

Cons of saving files directly in MongoDB

  • You may have to convert the files to blob in order to store it in db.
  • Database Backups will become more hefty and heavy.
  • Memory ineffective. To add more, often RDBMS’s are RAM driven. So all data has to go to RAM first.

How MongoDB stores large files

One word, GridFS.

MongoDB stores all documents as BSON, which is just a binary encoding of the good old JSON format.

The maximum BSON document size in MongoDB is 16 MB. Which is nothing when we want to store large files like videos and songs. They might easily exceed the limit. GridFS is a specification for storing and retrieving files that exceed the BSON document size limit of 16 MB.

The idea behind GridFS

Instead of storing a file in a single document, GridFS divides the file into parts, or chunks, and stores each chunk as a separate document.

By default, GridFS uses a default chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary.

GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.

  • fs.chunks: Used to store the file data itself. In binary format.
  • fs.files: Stores file metadata.
See the whole gist here
fs.files collection

When you query GridFS for a file, the driver will reassemble the chunks as needed. You can perform range queries on files stored through GridFS. You can also access information from arbitrary sections of files, such as to “skip” to the middle of a video or audio file.

Here is an example code of what retrieving files from GridFS looks like:

See the whole code here

This program fetches the path to the fs.files.bson dump, gets the file IDs. Then performs a fetch call to GridFS for re-assembling and serving all of the files back to us.

We then take the data received and write it to a file in our local filesystem (in a folder called files). Replace YOUR_DB_URI with the URI of your MongoDB instance to see this in action.

Pros of GridFS

  • GridFS can store as many files as needed.
  • It can be used to recall sections of files without reading the entire file into memory.
  • It can be used to keep your files and metadata automatically synced and deployed across a number of systems and facilities.
  • It can be indexed and sharded. Drivers that support the GridFS specification automatically create indexes for it.

Cons of GridFS

  • It does not support multi-document transactions.
  • It is not ideal for a system where you need to update the content of the entire file atomically. In this, it is cheaper to store multiple versions of the file.
  • If the files are smaller than 16 MB, it is a lot cheaper to store them using the BinData data type in MongoDB.

--

--