Don’t be afraid of the big bad blob!

Avihoo Mamka
Onfido Product and Tech
6 min readApr 19, 2017

At Onfido, we aim to automate verification of identity documents using machine learning models that extract different data properties.

If you’re familiar with machine learning, you probably know that these trained models can sometimes get pretty big, up to multiple GBs per model.

Now, think of these models stored in your Git repository. Every change in these large files causes the repository to grow by the size of the file. You can then easily get a repository that weights dozens of GBs, which is much harder to maintain — and not really what Git was built for.

In this blog post, we’re about to explore possible solutions to the problem and why they didn’t work out for us, and finally we’ll go in depth about our chosen solution.

Round one

So, like any team of good engineers, we checked what are the best practices, and what the rest of the world is doing.

We found out that the common solution was by far, moving to Git LFS — Git Large File System, where blobs are being written to a separate server and only a pointer file is being saved to the git repository. We tried that practice, which didn’t end up well for us mainly for two reasons:

  1. At the time of trying that practice, Jenkins didn’t officially support Git LFS which caused us a lot of pain causing our continuous deployment flow to break and needed a lot of workarounds to make it work-ish.
  2. That practice broke our simple work flow of:
  • Clone a repository
  • Start working
  • Build the app locally

To:

  • Install Git LFS extension
  • Clone the repository
  • Initialise the LFS
  • Pull the large files
  • Start working

At that point, we knew that this kind of solution was not suited for us and we needed to think out of the box to find a better way.

Round two

After spending quite some time in researching and brainstorming, we came up with the following architecture

Onfido’s blobs new architecture

We have our code repository. Part of that repository is the Dockerfile which includes a specific instruction to resolve all the dependencies of the project as described in the dependencies.json file.

Each dependency is provided with its path of where it can be found, and the version required for it.

All of the dependencies are stored in S3 which is a fast, reliable, scalable and robust remote file storage.

Architecture in depth

Our models are being stored and versioned in an S3 bucket by our machine learning engineers. This architecture scales easily to help us industrialise our machine learning pipelines.

This led us to develop and recently open-source the s3-uploader command line tool. This command line tool is written in Python, and its purpose is to upload any resource to S3 while taking care of the versioning for you.

To use it, one just simply needs to run the following command:

s3-uploader -b s3-bucket -f /path/to/large_blob -l project/blobs/data/

Where you only need to provide the name of your bucket, the path to the blob and the location in which the blob should be placed in your project. After that, you’ll get the json output to add to the dependencies.json file.

Inside the code repository, we have the dependencies.json file which is a configuration file that provides all of the project’s dependencies that are not stored in git. The file content is as follows:

{
"dependencies": [
{
"location": "project/models/data/",
"name": "trained_model",
"version": "2017-03-27-15-22-20"
},
{
"location": "project/blobs/data/",
"name": "large_blob",
"version": "latest"
}
],
"repository": "s3://s3-bucket"
}

In addition, we have the Dockerfile which contains all the instructions to build a docker container with all the code and the trained models that later will be deployed and run on Kubernetes.

The part of the Dockerfile that is in charge of resolving all the project’s dependencies is this one-liner:

# Resolve project's dependenciesRUN dependencies-resolver -c dependencies.json

This led us to develop and recently open-source the dependencies-resolver command line tool. The command line tool is written in Python and handles all the project’s S3 dependencies for you. In our use-case, the resolver use defined in the Dockerfile. While building the container, the resolver downloads all the dependencies and places them in their specified location. The tool is not required to run as part of the Dockerfile and can be used outside of that context.

The tool was designed to be as simple as possible with only one parameter, which expects a dependencies configuration json file.

The file is structured by a predefined schema that is being enforced by the tool, and simply downloading the dependencies from the specified bucket to the specified location provided.

That way, we have our models being downloaded only when the container is being built and before the app starts running and thus if a problem occurs, the build phase fails and we can fix the problem before breaking production. This also helps us keep our git repositories relatively small, making it easy to clone and maintain.

Solutions Comparison

When first looking at the two different solutions: Git LFS vs. S3 based store, it appears that they’re quite similar. They both stores the blobs on a remote store and they both have a pointer to where that blob can be found.

However, if you deeply understand the design of both solutions, you’d probably notice a small difference which actually tends to be much more significant. The difference is when the dependencies are actually being resolved and by whom.

In the Git LFS solution, in order to start work on your code and later to build the container and test your application, you first need to install some git extensions, initialise LFS in the project and then pull the blobs. Every time the blobs changes, it’s the developer’s responsibility to pull the latest version of the blobs manually. It means that the developer is in charge of the resolve process.

On the other hand, in the S3-based store solution, the flow is simpler and less manual. There is a file which contains all the project’s dependencies and the developer’s only responsibility is to maintain that file in order to make sure all the dependencies present with their correct versions and that’s it.

The process of resolving the dependencies is being executed at the time of building the container which makes the developer’s flow much more simple of just cloning the repository and start working.

Once the container is being built, the dependencies are automatically being pulled by the resolver and not by the developer.

In addition, there’s also an option to always use the latest version for a dependency which completely removes the developer’s responsibility to maintain the versions and coordinate that work.

To conclude, although both solutions might look like they designed the same, there are some clear behavioural differences which tends to make the S3 based store solution much more elegant and makes the developer be focused only on what’s necessary for the project.

Recap

We always try to tackle our problems in an iterative and engineering-based way.

We tried to tackle the problem of having large blobs stored as part of the git repository which made our life a living hell.

We first tried to go with the flow and follow the common practices of moving our large blobs into Git LFS. However, at Onfido we always try to find a better way, so we designed a new solution that we are happy with.

The solution which consisted of developing two new command line tools that we recently open-sourced: the dependencies-resolver and the s3-uploader.

These tools complement each other and together complete the flow of having large blobs be part of our working projects, without actually being stored in them.

--

--