How to Transfer a Large Amount of Files From MongoDB to an S3

Migrating lots and lots of files in a day

Published in

Quick Code

10 min readDec 17, 2019

In this story, which is my very first, we will dive into the process of migrating your MongoDB files to an S3 object storage (although you can follow all steps before the transfer, if you want to migrate them elsewhere).
We’ll take a brief look on how mongo stores files and how you can access them. Then, we’ll see how to export those files to your server’s disk and finally, how to upload them to your S3 instance.
But first, a little backstory on how I ended up needing to transfer 41GB of 300K image files.
(In order to follow through, it would be helpful if you had a basic understanding of the Linux ecosystem)

A couple of years ago we started developing a social mobile app, which used Parse Server as its backend. For those who don’t know, Parse Server is a Node.js infrastructure that leverages the benefits of the express framework and uses MongoDB as a database.

By default, Parse Server has a file adapter that lets you store your app files directly to MongoDB, very easily (they also offer S3 and GCS file adapters). So my ignorant past self, who was responsible for the backend, as well as the iOS part of the app, thought that it was a very convenient/money-saving idea to leave it the default way.

Two years and 300 thousand image files later, our app grew a lot and we decided that storing all these files to our MongoDB, isn’t probably the best idea! And so, our files-transferring journey begins!

So first things first, how the hell does MongoDB store files?

Well, after some researching I figured out that MongoDB has a filesystem called GridFS, which is responsible for storing and retrieving files. Here’s how it works. There are 2 collections, fs.files and fs.chunks. The first one contains the metadata of every file you store, like its name and creation date. The latter, contains — among other things — the actual data of the file in a BSON binary format.

Here’s how a file actually looks in fs.files:

{
  "_id" : ObjectId("5ae97922c1dabec8d2d0bdb0"),
  "filename" : "2b57455f3878d11dabc9c984da7de314_postImage.jpeg",
  "contentType" : "binary/octet-stream",
  "length" : 2291623,
  "chunkSize" : 261120,
  "uploadDate" : ISODate("2018-05-02T08:38:58.549Z"),
  "aliases" : null,
  "metadata" : null,
  "md5" : "9ad420eaa7c28a73e449199430627802"
}

And here’s how it looks in fs.chunks:

{
  "_id" : ObjectId("5ae2d77f6616b4a9d93cb4b1"),
  "files_id" : ObjectId("5ae97922c1dabec8d2d0bdb0"),
  "n" : 0,
  "data": BinData(0,"iVBORw0KGgoAAAANSUhEUgAAAuAAAAJvCAYAAAA6OGQEAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAABWWlUWHRYTUw6Y29tLmFkb2JlLnhtcAAAAAAAPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyIgeDp4bXB0az0iWE1QIENvcmUgNS40LjAiPgogICA8cmRmOlJERiB4bWxuczpyZGY9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkvMDIvMjItcmRmLXN5bnRher38tcAAAAASUVORK5CYII=....")
}

So, what exactly are we looking at right now?

fs.files is pretty straightforward, so let’s start from there. Every document has 9 standard fields, 3 of which we care about right now: _id, filename and uploadDate. Those three fields will come in handy later.

Let’s now have a look at fs.chunks. We see it has an _id field, like any other document in mongo. Next, we see it has a files_id field. That is the id reference of it’s associated fs.files document. Moving on, there’s the n field which is the sequence number of the chunk. Now, if you are a beginner in MongoDB you’re probably all like “Dude, you lost me!”. Well, you see mongo doesn’t just take a file and puts it in a document. It actually splits it in parts — or chunks 😉 — of 255kB each. So if you went and stored a file of 400kB in size, it will create 2 fs.chunks documents… one of size 255kB with n of value 0 and another one, of size 145kB with n of value 1. Lastly, we have the data field which contains part of the file’s data (depending on whether the file is split into more than 1 chunks, or not. If it’s not, it contains all of it.).

If you want to see first hand how your files look in mongo, connect to your mongo shell and type:

use your_db_name
db.fs.files.findOne()

And to find its associated chunks, type:

db.fs.chunks.find( { files_id: ObjectId('the _id your fs.files document has') } )

Now that we have a good understanding of how files are structured in MongoDB, let’s move on to the next part!

How to export files from the database to the disk?

That’s the part where I struggled the most! Before all this, I didn’t have a good understanding of mongo, so I did what every developer does when he/she find himself/herself to a dead-end… google my way out!

Apparently mongo offers a very convenient command-line tool — that means you run it from your bash shell and not from your mongo shell —called mongofiles. One of the things it does, is to locally export the actual files you ask it too, wherever specified.

Let’s look at a simpler example first:

mongofiles -d db_name list

What it will do is to display you a list — just the filenames and their size, not the actual files — of all the files you have stored in your database.

Note: If you have enabled authenticated access to the database, meaning you have set username and password, you will need to include the following arguments to the command:
-u: username
-p: password
--authenticationDatabase: auth_db_name // probably admin

Now to the more juicy stuff:

mongofiles -d db_name get filename.jpeg

Here we actually export a file from the database, named filename.jpeg and store it to the current directory. Pretty easy, right? Well, if we want just one file, yeah! But we want to export every single one, so things are getting a bit more complicated! You see, since we need the name of every file we want to export, we have to somehow get those first and then call mongofiles get for each one. So how we do that?

We could use the list argument like before to get every filename but we will face three problems:

By using list, we actually request every single filename from the database! If we have like a thousand or two thousand files, sure… go ahead! But what if you have 100K? 200K? You would need to wait quite some time just to fetch the filenames and a looong time to have them exported. I actually calculated it and for my machine, it would take about 10 hours to export 41GB of 300K files!
We will have to parse out the returned size of each file, because we only need the filename.
While we are exporting all those files, new ones could be stored at any moment! Especially if your app has lots of traffic, they could be too many! What will happen to those? We can’t just say “Who cares? Leave them!” We also can’t keep running this whole thing over and over again.

Clearly it wasn’t a viable solution, so this is what I decided to do: I used the mongoexport command-line tool, to get the filenames based on their uploadDate (mongoexport offers a query argument so we can be very specific to what files we want to search). That way, I could export filenames from uploadDate-to-uploadDate. For example, between 2018–07–01 and
2018–12–01. After that I would upload them to my S3 — more on that later — and then again, I would export filenames from 2018–12–02 and 2019–03–01. I repeated that cycle until I reached the last day’s files.

Here’s how I used mongoexport:

mongoexport -d db_name -c fs.files -f filename -q '{ uploadDate: { $gte: new Date("2019-12-01T00:00:00.000Z"), $lt: new Date("2019-12-16T16:35:00.000Z") } }' --csv -o 2019-12-01-to-2019-12-16.csv

Note 1: Again, if you have enabled authenticated access to the database, you will need to include the -u, -p and --authenticationDatabase arguments to the command.
Note 2: After we run that command, we will get a prompt of the number of documents (just the filenames in our case) exported. It’s helpful to remember that number, to cross-validate later with the number of exported files.

Let’s have a quick look on the arguments we used:

-d: the database name
-c: the collection we’ll search (fs.files in our case)
-f: the field we want to export (since we only need the filenames, that’s what we’ll set)
-q: the query based on which the fs.files collection will be searched
--csv: the type of the file that the filenames will be exported to // we use csv because it suits us best
-o: the name of the file the output will be stored to

So we have our desired filenames. If we cat the .csv file, we’ll see that list.

Now it’s time to use that list and finally export our files from the database!

For that, we will create a bash script named export_files.sh. We are doing it that way, ’cause it’s more of a good practice, rather than just type the command each time:

#!/bin/bash# Define a timestamp functiontimestamp() {
    date +"%T"
}MONGO_DATABASE="db_name"
MONGO_HOST="127.0.0.1"
MONGO_PORT="27017"
DATABASE_USERNAME="username"
DATABASE_PWD="password"
DATABASE_AUTH_DB="admin"echo "Started at $(timestamp)"xargs <"$1" -P$(nproc) -n1 sudo mongofiles -h "$MONGO_HOST" -d "$MONGO_DATABASE" -u "$DATABASE_USERNAME" -p "$DATABASE_PWD" --authenticationDatabase "$DATABASE_AUTH_DB" --quiet getecho "Finished at $(timestamp)"

A quick description of the code:

We first declare a timestamp function, which we’ll call before we start exporting files and after we’re finished, just to have a sense of how much time it took.
Then we declare some necessary variables that we need to use as arguments to the mongofiles command.
Next up, the fun stuff! Here we’ll break down the command into 2 parts:

xargs <”$1" -P$(nproc) -n1: xargs reads the input (“$1”) line by line and passes each line (filename) as the argument to the process. The -n1 tells xargs to run the command for each line/filename. The nproc is a linux utility to get the number of cpus, and that option is passed to -P which specifies the number of processes to run concurrently.
sudo mongofiles -h “$MONGO_HOST” -d “$MONGO_DATABASE” -u “$DATABASE_USERNAME” -p “$DATABASE_PWD” — authenticationDatabase “$DATABASE_AUTH_DB” --quiet get: this takes each filename as input from the previous part of the command and finally exports it.

Note: In order to run the script above, you need to give it execution permissions, like so: chmod +x export_files.sh.

If you go ahead and run ./export_files.sh filenames.sh, the exporting will begin! (Remember, the more files you specified to be exported, the more time it’ll take)

After the script finishes, run ls to see all the exported files. Next, you could run find . -type f | wc -l to make sure the number of exported files, equals the number of filenames exported to the .csv file (remember, I mentioned it before). If you want to see the total size of these files, run
du -h ..

Tip: Make sure you create a directory where all the files will be exported to. What I did was to create directories for every set of files I would export. So keep it clean! You don’t want to end up having thousands of files in your home directory, trust me!

That solution had many advantages, like I could stop at any moment if needed and then continue from where I left off, I wouldn’t have to have my server under that pressure for 10 hours straight and more importantly, I could transfer the last files remaining — those that were created while we were exporting — very easily. Also, by using xargs we have the ability to run stuff faster, so that’s definitely a plus!

So we got the files… now let’s transfer them to our S3!

How to transfer these files to my S3-compatible object storage?

Note: Did you notice the compatible keyword I used in the title? It turns out, not only Amazon offers S3 object storage services… there’s DigitalOcean as well, and others too!

So here we go! Our final step! If you managed to come this far, then trust me, it’s an easy one!

In order to transfer the files from your server to your S3, we’re going to need a tool that does just that. If you google it, you will find plenty! There are even browsers for that! The tool we’ll use, is called s3-parallel-put. (click it to find out how to install it)

The reason I chose this tool, is because — as the name suggests —it can transfer files in parallel. That’s very important if you have multiple GB of data! Before we use s3-parallel-put, make sure you installed it, you are in the right directory — where you exported the files — and that you don’t have any other file in there you don’t want to transfer, like the bash script.

Ready? Type this:

s3-parallel-put --bucket=bucketName --bucket_region=region --host=region.host.com --put=stupid --processes=13 --grant=public-read .

Simple, right? Let’s have a look on how to use it:

--bucket: that’s the name of your S3 bucket
--bucket_region: the region you chose for your bucket, like us-east-1, or fra1 etc
--host: your provided hostname, like s3.amazonaws.com or digitaloceanspaces.com --processes: that is how many processes it can create, in order to transfer the files faster. The number, depends on the number of cpus your machine has * 3. (3 is not the “right” number… you should test it with different values to find what number works better for you)
--grant=public-read: the kind of access your files should have
.: the local directory where your exported files live (the current directory in my case)

That’s it! It will take some time for the files to be transferred, but we’re done (at least for the current batch of files). It wasn’t that hard, right? All you got to do now, is to repeat the exporting and uploading steps for the rest of the files, until you are done!

Final thoughts

Now that we are done, let’s summarize all the needed steps:
First, we have to find filenames based on their uploadDate. We did that using the mongoexport command. Next, export those files using the bash script. Finally, upload the exported files using s3-parallel-put. Repeat.

I would also like to mention s3cmd, which is a tool that makes it easy to search, delete and even upload files on an S3, though we won’t use it for uploading, ’cause it’ll be much slower. I think you will find it useful in case something goes wrong and need to start over (hopefully it won’t).

Note: Keep in mind that you will probably need to replace the stored file URLs in your database with the new ones from the S3, since they are not hosted in your database anymore. Thankfully, I didn’t have to deal with it… Parse Server’s file adapter took care of it.

When I had to make this change, I couldn’t find a complete guide on how to do any of this! I searched and searched, I had to collect pieces from here and there… it was frustrating and exhausting! So I hope reading this, you found it easier than I did!

Good luck! 🙏