A Hidden Bug in Machine Learning Pipeline for Understanding Math Problems

Every time I think I have an idea about where did the error come from — I dig deeper into it and find out that I was wrong.

Published in

Photomath Engineering

7 min readMar 17, 2022

Photo source: https://pixabay.com/images/id-2615008/

For several days already I have been debugging my machine learning pipeline and trying to spot the bug that is killing my training process each time it comes near the end of the first epoch. This kept happening for days and since it takes some time to run each experiment, I went nuts already.

To be able to explain what exactly happened, let me first put it into the project perspective.

Math Word Problem Solver

The project I am working on, started as my master thesis. I was looking for an interesting topic in the field of natural language processing that I could work on — and Photomath needed a way to understand and solve math word problems because millions of students get stuck with these daily and we would like to be able to help them get unstuck. If you don’t know what math word problems are, just think of those short mathematical stories from elementary and high school that most of the students struggled to solve. For example consider the next problem:

Brian is twice as old as Charlie. Three years from now, the sum of their ages will be 33. How old are Brian and Charlie today?

My focus is on creating machine learning models that will try and hopefully succeed to understand these problems and transform them into mathematical expressions with numbers and mathematical symbols that can be solved with some existing specialised tools. For transforming MWPs into mathematical expression, a very deep level of language understanding is needed which makes this an extremely difficult task. Since we are still far from crafting models that could completely understand human language, in some way we are limited with how good and robust MWP solvers can be.

The desire to understand and solve math word problems has existed since the first occurrence of machine learning (and probably even before that), but to this date, there hasn’t been any method developed that can match human expert performance.

After a few months of working on this project, doing both research and development, finally I have a machine learning pipeline that reads a dataset of math word problems, does some text preprocessing, runs training and validation loops and all that usual NLP stuff. But since it is a bit bigger project which uses multiple libraries, let me describe it in some more detail.

Setup for my project (multi-GPU training)

Modern machine learning models are usually quite big and it would take a very long time to train them locally on your machine, especially on CPU. For that reason, these trainings are done on server machines containing multiple GPUs that can be used simultaneously in parallel for a single training experiment. To utilise this common practice, we have to implement a multiprocessing pipeline, or use a library which already has this implemented, which is what I did. Considering the fact that this is done on a GPU server where also the rest of the team is working, we need a tool for scheduling experiments, tracking metrics and other MLOps features. So, when you start an experiment, it goes to a remote machine where you can track metrics and see output logs.

Why am I writing about all these specificities? It’s because I would like to better explain how hard it could be to debug a complex machine learning pipeline.

Err…or

After an hour or so of running the experiment I would get the full stack-trace of the bug, which points to internal files of the libraries that I’m using and ends with the following line:

torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL).

My first hunch was that it’s probably because of the memory. Most likely, my batch size was too big and one of the GPUs died because of the out-of-memory error. But, after looking at the stats for both GPUs I saw that none of them ran out of memory. However, I noticed some very strange behaviour. The following plot shows utilisation percentage for both GPUs, one in green and the other one in blue. It doesn’t really matter which one is which, but the important thing is that one dropped to 0% utilisation and the other soared to 100%. Eventually, the process would die.

Utilisation percentage for two GPUs during training

“Okay, now why is this happening?” — I was asking myself. The first logical explanation for me was that there is some kind of a deadlock happening and while one GPU is working hard, the other is waiting so that they could average gradients and update parameters of the model together (this is how distributed data parallel module works in PyTorch). But if it really is a deadlock, why is the GPU on 100% utilisation and what is it doing? That part didn’t really make sense, so I went deep into debugging and logging everything to identify what it is doing. After some time, I managed to find the exact line where it stopped. For better understanding of what happened, I have to make a small digression.

The metric I — and most of the researchers working on the same problem use, is plain accuracy. Even though labels and consequently the output of the model are mathematical expressions, we don’t compare these expressions, but instead evaluate these expressions and compare if the result is the same as when the label expression is evaluated. Researchers have come up with this type of evaluation because there are multiple correct solution expressions for a given problem and the output of the model doesn’t have to perfectly match the label in order to be correct.

Now that you know how the model is evaluated, I can continue where I stopped. The GPU which was on 100% utilisation “stopped” on line:

eval(label_expression)

Eval is a built-in function in python which I used to evaluate and produce the result of both label expression and the one generated by the model so that I could compare them. This function has some serious security vulnerabilities, but in this isolated environment it is a good fit for evaluating mathematical expressions. But why did it stop there? I already preprocessed all the expressions found in the dataset and all of them could be evaluated with this function. But then again, when I tried to run an experiment with only a sample of the data — it worked. It always stopped during validation, so I thought that it has to be something with the validation data and I logged everything that went into that function to see which expression is the reason it keeps stopping. It was very interesting to see that it stopped on evaluating this crazy exponential equation: ((14166452¹²)³²)⁷.

My reaction? This is a huge number — of course this takes a very long time to calculate, and no wonder the GPU was on 100%. But how did this equation even end up in the dataset? I have already tested evaluating all the equations from the dataset and it worked perfectly fine. So obviously it’s not about the dataset, but instead the problem is probably in tokenisation. I managed to find the equation in the dataset and the correct format is: 1+4+16+5+2¹+2³+2⁷ (note that in the dataset, equations are written in the format 14166452**12**32**7 and 1+4+16+64+5+2**1+2**3+2**7). Clearly, I was missing ‘+’ signs all over the equation. I went to see logs from other equations and saw that none of them had the addition symbol, but this was the only equation that was so computationally expensive to solve.

The search was still not over, but very soon, I found the root of the whole problem and it was not the tokenisation, because when I tried tokenising the equation, the plus sign did not disappear. Instead the problem was in the process of building the vocabulary for the model. While I was building the vocabulary of all the tokens used, I included only the original text of the problem, but forgot to add labels in the process. While all other symbols appeared in the original text, the plus sign obviously didn’t. The problem manifested when I was decoding equations to be able to evaluate them. Since the addition token was not in the vocabulary, it returned the <UNK> token which is a substitution when a token is not present in the vocabulary (unknown). In the decoding phase, I remove all these special tokens (including padding, unknown, start-of-sequence and end-of-sequence tokens) and the bug remained hidden until the eval function struggled to solve the absurd equation.

Takeaways that could save you your precious time

Hopefully, I have managed to demonstrate how hard it can be to debug machine learning pipelines and how good the bugs can be at camouflaging. Here are some of the things I will put more emphasis on the next time I am developing an ML pipeline from scratch. Use them to save time and keep you sane:

patiently implement every single part of the ML pipeline, do not rush unless you want to spend a lot of time debugging
often do tests during development to be sure that everything works as expected — you can always open a terminal, run python locally and execute a few commands to see if everything works the way you would expect it to
first experiment with small baseline models locally to see if everything is okay with the pipeline, without wasting time on waiting for the big models to finish an epoch on a GPU server just to see if it works

A very useful blog on how to train neural networks, which in my opinion everyone starting in the field should read, by one of the most popular AI researchers can be found here: http://karpathy.github.io/2019/04/25/recipe/

If you have a better idea on how you would avoid having bugs like these — comment bellow, or even better, apply at our careers page and tell me in person. We are always looking for great talent.

Like what you’ve read? Learn more about #LifeAtPhotomath and check out our job postings: https://careers.photomath.com/