You shouldn’t store your images/videos in a database

Yasser Douslimi
ENSIAS IT
Published in
8 min readMar 29, 2021

One of the most classical dilemmas developers face when they need to store media files is which solution is best suited for their needs. While it is certainly possible to use a database for this matter, it is however unwise because database transactions are expensive. There is also the fact that we don’t want our database to become bloated especially with the amount of data that needs to be processed in order to do store them.

I hear you, “then what should I do?” you say. It’s very easy, file systems. We will go more in-depth into why this is a much better solution especially if it’s paired with a database at the same time as we progress through this article.

Why a database is a bad idea

The first thing that should cross your mind is that databases don’t recognize your file. There is no “image” or “video” data type. This is why you will need to manually convert your files into blobs and manage them yourself in whatever format you choose (base64, binary, etc.) This is an extra operation that needs to be handled each time you do any transaction with the file. Want to store an image? Convert. Want to extract an image? Convert. Want to update an image? Convert.

This is extremely bad for scalability since you can easily imagine what would happen if you managed a service with over a million active users. Yeah, that company would go bankrupt really quickly.

If you thought that this is the only issue related to performance then you’re mistaken, because as we said, transactions are expensive. Unless you’re dedicating an entire database instance or network for media files separate from your main data model, then the amount of overhead that these operations generate will obstruct other crucial database transactions. In other words, you are burning valuable RAM that could be better used for other CRUD operations relating to your business logic.

The blobs we store into the database are also generally inefficient because you are blowing up your database cache on raw data that is using up space. As time goes and your cache gets filled, it can create bottlenecks very quickly. Instead, we should delegate caching to a separate intermediate layer like Redis and cache the actual image and not just the blob. With this, we can avoid overloading the database, but the goal here is the abstraction since we won’t care what kind of persistence we implement with this new layer in-between.

The alternative

On the other hand, when we think of files and how to store them, filesystems come to mind instantly since that’s what we use on a daily basis. Filesystems have built-in support for file types which makes them a logical choice.

Naturally, filesystems come with their own paradigms which need to be adapted in order to get the most out of them. Instead of tables, we talk about tree structures. This hierarchy can be very useful as well as very tricky to use. The relationships between different types of files and their properties will need to be mapped in a different way than traditional relational databases.

One other thing to keep in mind is that, just like databases, there is not a single type of filesystems that exists, and each one of them has its own advantages and disadvantages. Which one you choose can influence so many variables like speed, security, etc.

What’s certain is that for unstructured data like media files, filesystems are better suited for the job. However, this does not eliminate the need for a traditional database at the same time. We need this database to store the metadata of the files we are managing and this sort of symbiosis can really improve the performance of our application. In so many cases, the user won’t even need the actual file, but only consult superficial information about it. Think of YouTube for example, as you scroll your home feed you certainly don’t watch each and every single video you see. If YouTube’s servers were to requests all the videos as you scroll then the load on its servers would skyrocket apocalyptically.

The database should only store the path to the file. It should leave the heavy lifting to the filesystem. The benefits of this are huge. For one, you can get extremely high speeds by sending the file directly from the filesystem to the network interface. There is also no need to convert the file from a blob since it remained intact while it was stored.

The drawbacks of filesystems

As you may have guessed, it’s not all good and dandy with filesystems. As with any solution, you are getting benefits and making sacrifices.

While using filesystems, you are losing out on some benefits that you get out of the box from databases. There is the fact that you’re eliminating transactions which means that it is harder to maintain referential integrity but is certainly not impossible. Many database solutions are ACID compliant and that guarantees that the metadata and the file are synchronized. For filesystems, you will need to apply certain rules to avoid losing that link between the metadata and the file which is its address on disk. For example, the filename should remain immutable and randomly generated to avoid overlapping with other files since if we can’t point to the file on disk then we have essentially lost that file. Keep in mind that you will also need to delete the record of the metadata if you need to delete the file.

Furthermore, databases often implement a backup strategy out of the box, but with filesystems, you will likely need to set it up yourself. (You should be fine if you use docker volumes though.)

One last thing that is often mentioned is that you lose the security the database provides for you. In my opinion, if you rely on a filesystem that is inspired by Unix, then you can achieve an even greater secure model, but that will likely need a security expert. However, filesystems in general can provide bare minimum protection and if you’re that worried about it from a security standpoint, then you can always encrypt your files before storing them.

How do Big Tech companies store their files?

If we’re thinking about companies such as Alphabet or Twitter, then it’s safe to say that the amount of data they’re dealing with is considered Big Data. I think you can see where this is going.

One popular solution to manage big data is Apache Hadoop, and for our purposes, we will ignore how it works and some concepts like MapReduce that are related to it. What matters to us right now is what it uses under the hood which is HDFS or Hadoop Distributed File System. When you want to store a file, it gets broken into blocks and it gets replicated across the cluster. This redundancy is very helpful to increase data availability and accessibility.

Of course, Hadoop is not the only way of achieving this in the context of big data but we can clearly see that the file system is the more attractive choice here.

Facebook Haystack

For Facebook, at least as of four years ago at the time of this writing when they last updated the GitHub repository, they use a solution that integrates Apache Cassandra for their Haystack Directory that tracks file metadata and needs to be consulted first before reaching the Haystack Store that is built on top of Docker Volumes which are based on filesystems. There’s also a Haystack Cache layer implemented with Redis, but we can ignore it for now.

To write data, the client goes first through the webserver that requests the Haystack Directory which stores photo information and then lets the server send the file directly to our Haystack Store that will host our file safely.

To read data, we first talk with the webserver that provides us with the file information, and then we request the file directly from the filesystem without going through the server which can be very performant.

What if I don’t have a big company?

Many small to medium companies who don’t want to host their own filesystem solution choose to turn to cloud-based solutions. Amazon S3 is one of the most popular choices here because it’s already a mature platform and they do offer some other perks that go along with it. Other options exist too such as Backblaze B2.

For others, they prefer a more in-house solution. The easiest solution is to host a web server and exploit its Unix filesystem. A more elegant solution would be using docker containers which will give us access to docker volumes which can be much more performant and flexible. In case the solution needs to be distributed across a cluster of nodes, we can use either docker swarm or Kubernetes and many abstractions on top of them exist such as min.io which serves as a Kubernetes-native object storage suite.

Some databases however like SQL Server implements a FileStream data type that store the blobs as files on disk which is essentially a filesystem, but it does manage them just like a regular database which can be very helpful to some types of companies.

Final thoughts

Storing media files is a very challenging feat that many developers don’t even pay attention to. While I do advise against using databases to store such files, I do recognize however that each case is different and I hope that this article has given people enough information to help them make the right decision if faced with the same problem.

I now highly encourage those of you who have never tried to store images to go now and start a mini-project and try it out! I am sure it will be a fun experience.

--

--

Yasser Douslimi
ENSIAS IT

Aspiring software engineer. Curious about everything tech.