CodinGurukul
Published in

CodinGurukul

How Google Drive / Dropbox Works behind the scenes 💡

Technology is evolving everyday and it is becoming a major part of our daily life, everyday we capture a lot of photos and videos and share it with our loved once located anywhere by using services/apps like Google Drive,Dropbox etc.

But have you ever thought how these apps works behind the scenes,in this article we will explore everything about technology behind apps like Google Drive,Dropbox or in general any file hosting and sharing apps.

First make a list of basic features we want in our file sharing system/service:

  1. Upload/Download files.
  2. Sync with local folders of one or more devices/clients.

3. History of Updates(Keeping record of various changes occurred in different versions of same file).

Now lets define at what scale our application is going to be used:

1).10 + million users.

2). 100 million requests per day.

3). Very High read and write operations.

Before developing any application we need a proper combination of features and what scale the application is going to be used.

Let’s start small with an example of simple example of a wireless external hard drive attached to a computer:

Understanding this simple scenario:

We have a local file named “My_resume.doc” and we edited it many times and uploaded to our wireless hard drive. After every edit it is a new version and we are saving all the version to keep a record of different version changes and updates history. Bandwidth usage is the amount of network resources used to upload that file to hard drive.lets make a table to calculate the resource utilized by this single file for its storage to the external wireless hard drive.

Let’s take the sum of total resources used after 3 Versions:

Network Bandwidth: 5 MB + 6 MB + 8 MB = 19 MB

Storage Space on Hard drive: 5 MB + 6 MB + 8 MB = 19 MB

In our example it is a small file so resource cost is not a big issue but the problem arises when the file size huge(like 20 GB) then the resource cost will really hurt.

From the above approach we found the following Problems:

  1. Excessive Bandwidth Utilization.

2. Redundant Data in Storage.

3. Network Latency.

4. Concurrency issue during upload and download over network.

Now we have to find the solution. Lets see how??

Instead of uploading the complete file the file is broken into smaller chunks and then uploaded.

A File is broken into various smaller chunks.

Till now we used a wireless hard drive as storage but in reality this kind of apps uses highly scalable cloud storage in this article we are going to use Amazon S3(Simple storage service) as our storage unit.

Scenario When the file uploaded for first time

When we upload the file for the first time all the chunks are uploaded and a meta data file is created which keeps the address of all the chunks with their arrangement sequence.

Now here comes the interesting part. What will happen when we update the file, Actually we rarely update something in all the chunk.

Suppose we updated something in Chunk 7 and 8 of the file, then only these chunks are uploaded as Version 2 and a new Meta Data is created with the latest information.

Scenario when a file is edited/updated and then uploaded.

With this approach we solved have solved the following problems:

  1. Excessive Bandwidth Utilization.
  2. Redundant Data in Storage.

Now lets again calculate the resource utilization.

File Size: 24 MB (Previously 40 MB)

Network Bandwidth Used: 24 MB (Previously 40 MB)

Now we have to Solve 2 other problems i.e Network latency and concurrency.

Let’s Suppose our network speed is 1 MBPS on single thread then it will take 20 Seconds to upload the file. So introduced a script file “multi_thread.py” in between the chunking and uploading mechanism, this can upload 5 files by establishing 5 threads then the time required to upload a 20 MB it will take 4 Seconds which was 20 Seconds in single threaded model.

So till now we have solved the issues we listed above.

Now let’s dive in real system design and functioning:-

Complete System design of a file hosting/sharing service

From the above diagram on client side their are 4 components.

Watcher: It is responsible for watching changes in local files.

Chunker: It is responsible for breaking a large file into smaller chunks.

Indexer: It is responsible for creating the metadata and storing it into internal database.

Internal Database: It is responsible for storing the metadata.

Let’s understand it with more real life example:

From the above image let’s suppose we are logged in into 3 different devices and our files are synchronized in all the devices but the question is how??

Let’s go step by step, first we uploaded a file from our smart phone,the chunker makes the smaller chunks and upload it to the storage, after upload storage service returns the address of each chunk to the indexer and it stores it in the internal database.At the same time the storage service also updates the Metadata DB and when something is changes in Metadata DB informs the Messaging and Synchronization service to update other clients(on Laptop and tablet in this case) to fetch the latest metadata and update their Internal Database with latest file information.this is how synchronization works between multiple devices.

For every update this process repeats.

In this whole article we talked a lot about Metadata but in simpler way it is just a JSON object contains info of each chunk.

Sample metadata format of a chunk

So, This is all about the engineering behind a file sharing/hosting platform like Google Drive/Dropbox.

I hope you enjoyed it, feel free to Share and Clap, In future, I will be writing this kind of articles only at Coding Gurukul

Happy Exploring!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Suraj Kumar

Suraj Kumar

A Super Dynamic Human with perfect combination of Technical and Life Skills. @CodinGurukul