Mean Squared Error Can Be Too Mean

Allison Bishop
Proof Reading
Published in
6 min readApr 15, 2024

[TL:DR] We develop a new metric for evaluating predictions of closing auction sizes for US equities trading. This leads us to an improved prediction model, as well as some general insights into how to adapt out-of-the-box machine learning tools to perform better for financial data. Our full whitepaper can be found here.

My mother is retired now, but she was an amazing high school teacher. I know this only second hand, as she never taught my class. This was not a coincidence, as my mother has always understood phenomena of conditionality — that your “dessert stomach” can be empty while your “vegetable stomach” is full, or that the reservoir of patience you hold for other people’s children can evaporate instantly when it comes to your own. My mother wisely stepped aside to let others teach me the things that the combination of my anxiety and her perfectionism made untenable — how to dive into water headfirst, how to read music, how to drive. This way, when I came bounding out of gymnastics class eager to show her my cartwheels, she could just clap and say “that’s great, honey!”, which she always did.

Learning (especially the learning of hard things) requires failure to be met with grace. If you try to stamp failure out completely, you become stuck — sidetracked by an obsessive avoidance that leads to overcorrection. This effect is most debilitating when the failure you try to avoid is mostly random and out of your control in the first place.

Something like this was in my mind as I stared at scatter plots of closing auction sizes in US equities, like this one for the ticker CVS in the year 2022:

Some days are expected to be unusual, like the blue days when indices rebalance or the gray days when options expire. But still, the “regular day” red dots bounce around like popping corn kernels, defying a discernible pattern (at least to the human eye).

Nonetheless, the task of predicting closing auction sizes as a function of recent history and intraday indicators falls under the classic template of supervised machine learning. We have lots of stocks, we have lots of days of historical data — we can simply try different models and see how they fit! But there’s a catch — what do we really mean by “fit”?

The typical answer is to measure Mean Squared Error (MSE). Given a true value x and a predicted value y, the mean squared error is defined to be (x-y)². So to grade a model that makes predictions y₁, …, yₙ when the real answers were x₁, …, xₙ, we would compute:

where the wᵢ values are weights that allow us to express how much we care relatively about each data point.

There are many nice things about this definition that fuel its general ubiquity. For one, it is symmetric, as it equally penalizes predictions for undershooting vs. overshooting the true value. And for the calculus nerds in the audience, it is differentiable. For the non-calculus nerds, it is enough to know that the squaring makes it “smooth”, which generally makes things easier when you are trying to solve equations. Taking an absolute value instead, for example, would still be symmetric but would not be smooth, as the absolute value function takes a sharp turn at zero.

But — the same squaring that makes things smooth also makes big errors loom even larger — potentially driving the overall sum to be big even when most of the individual errors are small. An error of 2 becomes 4 when squared — an error of 3 becomes 9, and so on. In this way, mean squared error is mean! It harps on your biggest mistakes, perhaps distorting the overall picture.

Of course, I’m not concerned that the computer algorithms searching for models that minimize mean squared error are going to get their feelings hurt by this harsh grading system. But there is still an important effect here. If we minimize mean squared error to select a model for predicting something like closing auction sizes where outliers are heavy and difficult to predict, we’ll probably end up with a model that does better on particular outliers by coincidence, but largely fails when presented with new situations. This is like a student who has memorized the answers to a practice test, but flunks the real one because they don’t actually understand the material.

Issues like this are a good reason to explore other error metrics, such as mean absolute percentage error (MAPE). This error is defined by the formula:

Since the true values xᵢ are in the denominator, MAPE will have a more reserved response to errors than MSE in cases when the true values are large. A large absolute error in the numerator can now be tempered by a large denominator, somewhat muting the effects of unpredictably large values.

However, we now have a new problem — cases where the true values are unusually small! When the denominator xᵢ turns out to be surprisingly small, MAPE will add a hysterically large value to its error sum. MAPE is a metric that quite literally sweats the small stuff.

And this is where I felt stuck, staring at my whiteboard one day as I explained to Yuqian what the problem was. MSE cares too much about big mistakes, and MAPE cares too much about small answers. Neither is well-suited to closing auction size predictions, as big mistakes are probably unavoidable and small answers represent times when we couldn’t successfully trade much anyway.

Yuqian stared at the formulas for a moment. And then she wrote this:

“What does it mean?” I asked.

Yuqian shrugged. “It has the properties you want though,” she said. “The xᵢ in the denominator will control things when xᵢ is big and yᵢ is small, and the yᵢ in the denominator will control things when yᵢ is big and xᵢ is small.”

I stared at it for a moment. It was certainly appealing. “I think I can make it mean something,” I said. I wrote:

“Imagine that we make a prediction yᵢ for how big we think the close will be without our trading,” I explained. “But we want to trade an amount γ*yᵢ. Say γ is like 0.1 or something. If we suppose that we succeed and our activity just adds to the close size, then the true answer xᵢ becomes xᵢ + γ*yᵢ, and our prediction for the close size with our activity included becomes (1+γ)yᵢ. The absolute error of our prediction is still |xᵢ-yᵢ|. So what I’ve written here is just MAPE in a world where we assume we traded.”

We both stared at the new formula for a moment. “I think it might work,” I finally say.

As it turns out, it works quite well! Using this metric, we are ultimately able to train models that give stable improvements over baseline prediction methods for closing auction sizes. We are even able to retrofit existing machine learning packages to work with our new metric, instead of needing to rewrite from scratch. More details can be found in our whitepaper, but suffice it to say the results so far are promising, and we have plenty of ideas for how we might improve them further.

But I’m skipping ahead. As Yuqian and I stared at the whiteboard, we didn’t know any of that yet. We just hoped that the extra denominator might give our models the freedom they needed to learn through mistakes.

“What should we call our new metric?” I asked.

“MARY,” Yuqian declared.

Unbidden, an ingrained phrase from early years of Catholic school rose to the top of my mind. Hail Mary, full of grace.

“I like it,” I said. Afterall, it seemed like a fitting name for a metric with a little extra grace baked in.

--

--