Data engineering at Photomath

Preprocessing millions of images in few minutes (without breaking the laws of physics)

Published in

Photomath Engineering

7 min readJan 12, 2022

If you are following Photomath you are probably aware that we are extensively searching for a Data Engineering role to complement our ML efforts in the AI team. Until we find the right fit for the job, it is up to us, ML engineers to design and implement temporary data ingestion and preprocessing pipelines for use in the various downstream tasks but primarily for the training of our deep learning models. If you wish to hear more about how we, ML engineers, solved the task of preparing the image dataset for one of our projects, buckle up and read on.

Problem statement

Suppose you are experimenting with a similarity search approach for finding related mathematical tasks among all tasks for which a step-by-step solution is known and stored, one generously provided to us by Photomath math experts.

Similarity search for finding most similar tasks to a reference task

A viable solution to the problem is the use of a deep neural network to embed an image of the mathematical task in some high dimensional latent space in which the similarity under some metric, e.g. a cosine distance, could then be observed. In order to train this neural network a large amount of images of mathematical tasks is required. Before feeding the images to the network, one needs to fetch them from the central repository, preprocess them (crop, resize, reduce quality,…) and transfer them to the VM where the training will take place. The images which are stored in the central repository are being continuously collected directly from user devices in streaming fashion and their size is in the order of a few megabytes, this means that the raw dataset is in the order of a few terabytes or much larger.

In the following sections I will briefly describe a few approaches we tried for tackling this problem, so if you want to take a second to ponder about the solution now would be the time, if you only wish to enjoy the show, that’s fine too. If you came up with some of the solutions yourself, congratulations, if you think you have a better solution, do not hesitate to contact me, I will be more than happy to engage in a discussion.

VM + GCP Client Libraries

Since we are in the cloud, the Google Cloud Platform to be precise, the first and the most obvious solution was to consider the client libraries provided by Google. I implemented the image downloading, preprocessing and storing logic in a dockerized manner and spin it up on one of our Compute Engine VMs.

Positive sides of this approach is the guarantee that our solution will probably be easy to maintain (most of engineers are proficient with GCP and the documentation for the most part seems pretty clear), it depends on the stable and supported GCP Python Client Libraries and image processing tools (Pillow). This means that if some problem is present it will be relatively easy for engineers to fix it and we will most likely not be the only team who will be facing the problem so the fix will most likely be applied fast without much risks or losses.

The negatives are that the GCP client libraries do force some type of boilerplate, their dependency list is very likely not optimal and most importantly the time taken to process just a million of images even with the multithreading is unforgiving, more than 12 hours for a million images. I have abandoned this approach very early so I haven’t measured the time taken to transfer these images to the dedicated dataset storage i.e. another bucket, but a safe bet is around 3–4 hours; this is necessary to be able to fetch dataset to all of our training machines. This causes slight inconveniences; iterating training on various datasets is not as fast, resulting in overall delivery and experimentation speed being far from optimal. Can we do better?

VM + GCSFuse (and ultimately rclone)

Why not abstract the previous approach a bit more? We are in essence reading the image from some location, processing it and saving it to another location. Although this source is somewhere on the network, it is still on the high level just like reading from your local folder somewhere in the filesystem, so why not try it, is it reasonable to mount the central storage as a filesystem to our VM? It turns out that’s more than feasible with the help of either GCSFuse (officially supported by Google) or rclone (an open source, vendor agnostic, command line program to manage resources on the cloud storage). I experimented with both solutions, but for the sake of argument (and its benefits over GCSFuse) I will go with rclone.

The improvements over the first approach is that our Python dependencies are now only image processing libraries; accessing the image is as straightforward as if it was stored locally, so no more client libraries, no more boilerplate just pure Python. The rclone works well dockerized, although there were some minor issues in setting it up, mostly because it was a new tool, which could be a bit of a disadvantage depending on your use case. However, I believe that the wide range of tasks you can do with rclone is definitely worth taking time to learn, especially because of its vendor agnostic philosophy, if one day Photomath wishes to switch to AWS or Azure, engineers do not have to learn some other vendor specific tool but can continue to use rclone. On the other hand, the throughput improvement compared to the first approach is negligible and is still in the range of 12 hours for one million images. Again I haven’t measured time to transfer the dataset to its bucket, so once more add 3–4 hours for a complete process.

The problem these two methods are suffering from is of course network latency, the transfer and processing of million of images through of a single VM is just not gonna cut it. It was time to go to the drawing board and reconsider a complete architectural paradigm shift.

PubSub + Cloud Functions

And now for the moment you’ve all been waiting for, the throughput of the final solution (drumroll intensifies)? Well, as the title suggests, using this solution processing millions of images is now achievable in a matter of minutes. This time the images were saved directly to the dedicated bucket, since it is not possible to store it anywhere else from the Cloud Function.

It takes about 3 hours to download the images from the dedicated dataset storage to the training machine. This means that this solution is about 5x faster than the original.

The PubSub is basically a queue where you push a batch of images (more precisely their locations in the central repository) and on the other side of this queue is a Cloud Function subscriber which downloads each image in the batch, processes and saves it back to dedicated dataset storage. In addition, the processing is done in a multithreaded manner, giving us additional speed up. Since we are only calling this function when downloading a new dataset, the price of the solution is more than acceptable. This solution also goes along with the mentioned storage mounting options, thus benefiting from everything learned while experimenting with a single processor instance.

The solution is fast, elegant and scales well.

Drawing a line

I hope you learned something from my story of downloading images. Ultimately this enabled my team at Photomath to iterate 5 times faster, we could now prepare datasets with different image sizes, different resolutions, different crop sizes in a single afternoon, whereas if we used the first or second approach this would be nearly impossible. All of this will enable further optimisations to the training procedure and eventually the models themselves, example of which is finding the threshold above which the image size and quality are resulting in diminishing returns, implicitly optimising the amount of useful information contained in our datasets for the same amount of memory used.

This is most likely not the end of possible optimisations and more specific tools exist, nonetheless this journey made me a better developer, I learned a ton of stuff and if there is one thing I would like for you to take away from this article is to never stop questioning current and accepted solutions and always try to find a better one, the line must be drawn somewhere and at some point you must implement and move on, but if you are still a Junior like me, this line is most likely much farther than you think.

To those of you who are still questioning my last approach or have found a better one, you might just be the one. The Data Engineer position is still open so go ahead and contact us, step up our data engineering to another level if you dare and don’t be surprised if we expect you to write a blog to share your knowledge with the rest of us.

Like what you’ve read? Learn more about #LifeAtPhotomath and check out our job postings: https://careers.photomath.com/