Artificial Intelligence and the Continuing Quest for a Better PB

Last year I wrote a blog post on research I was doing on using a machine learinng technique called case-based reasoning (CBR) to predict a challenging but achievable personal best time for a marathon runner, and a pacing plan to go with it. The post described a paper entitled “Running with Cases:
A CBR Approach to Running Your Best Marathon”
that was presented at the International Conference on Case-Based Reasoning (ICCBR), which was held in Trondheim, Norway last year. Needless to say there aren’t many research papers that combine case based reasoning with marathon running but it went down well with the assembled attendees and ultimately received the conference’s coveted best paper award (yay!).

I’ve recently returned from this year’s ICCBR, which was held in a very hot and sticky Stockholm, Sweeden, where I presented a update on this research, entitled “An Analysis of Case Representations for Marathon Race Prediction & Planning,” and including a new PB prediction approach and new and improved results and I’d like to try to summarise this latest research in this blog post.

CBR is a fairly intuitive machine learning/problem solving technique. To solve some new problem we look for similar problems that have been solved in the past and adapt their solutions. For example, suppose we want to predict the price of a house going on the market, let’s say a 3-bed detached house with a large sunny garden, then we might look to the recent prices listed for similar properties in the same or a nearby location. This is essentially what a CBR system will do. It will have a so-called case base of other properties or cases (described in terms of features such as bedrooms, size, location, price etc.) and when presented with a new property it will look for similar properties, based on the available features, and adapt their prices as required. For example, if a similar property has only 2 bedrooms, then its price might be increased slightly given that the target property is a 3-bed. A CBR system like this will usually select the k (perhaps 5 or 10) most similar properties from its case base, adapt their prices, and combine the adapted prices (perhaps by taking an average) to predict a price for the target property.

That’s case based reasoning in a nutshell. But property pricing is not marathon running, so how can we use similar ideas to predict marathon PBs? To do this our cases refer to previous marathons for particular runners. In fact, each case is made up of two marathons for some runner, a non-PB race and a PB race. Each race consists of a finish-time and a set of 5km-segment paces; the data we used to generate these cases is drawn from the public results provided by big-city marathon web sites (London, New York, Berlin etc.).

Now, let’s suppose we have a new runner whose PB we wish to predict. This new runner presents with a recent marathon result (finish-time and 5km-segment paces) and we compare this race to the non-PB races for runners of the same gender (and running the same course) in our case base to select the k most similar runners. These runners will have run similar finish-times, with similar pacing in their non-PB races, and each of them have gone on to run a particular PB. Then we get the average of these PB times (weighted based on the similarity scores of the similar cases) to predict a PB time for the new runner.

The approach worked surprisingly well when tested on historical race data. For example, we were able to predict PB times that were off by as little as 2% compared to the actual PB times of the runners, at least for the fastest runners. However, for less able runners the error rates began to creep up steadily. They grew to 3% for 3-hour finishers and then up to 6% for 4-hour finishers, thus offering some room for improvement as a target for future work.

And so this brings us to our recent research. One obvious shortcoming of the above approach is that it relies on just a pair of marathon races to represent the progression of a runner from non-PB to PB. In reality, many marathoners will have run multiple (>2) races, which begs the question: why not capture all of this experience in each of the cases, to provide a richer representation of a runners marathon history and PB progression? Assuming that the data is available, then this is certainly feasible although there are a number of technical challenges with the approach. For example, if different runners have completed different numbers of marathons, then this will lead to cases with different numbers of non-PB races, which will complicate the similarity and matching process. The bottom line is that it’s much more straightforward to have a fixed case size.

A key insight was to recognise that past races could be categorised in different ways, to reflect the type of race the runner ran, good, bad or indifferent. Just as the PB race is the fastest race for a runner, another race might be a personal worst (PW). In the end, and in addition to the PB and PW races, we identified the following different types of races, which we refer to as landmark races:

• PPB — the previous PB for the runner;
• mNPB — a pseudo race which is the mean/average of the non-PB races;
• MR/LR — the most/least recent races for therunner;
• MV/LV— the races with the most/least pacing variation.

In this way, we can represent each runner using a fixed set of landmark races (PW, PPB, mNPB, MR, LR, MV, LV); see Figure 1. It is worth noting how the same race might appear as a landmark race more than once. For example, a runner’s PW race might also be their MV race; if they really struggled then their pacing probably varied a lot.

Now we can use the same CBR technique as before, selecting k similar cases — but this time based on the richer case representation — and adapting their finish-times to predict a new PB for the target runner. Of course we need to use the same categorisation to represent the history of the target runner and, realistically, this means that we need to have at least 2 or 3 previous marathons per runner, instead of just a single one — otherwise each category will correspond to the same race — but we can live with that.

How well do these new cases work when it comes to making PB predictions? Again, as per our previous work we evalauate this using historical race data, this time from Berlin, London, and New York. I won’t go into the precise details here — you will find a full account of the evaluation methodology in my previous post on this topic — instead I will get straight to the results.

Do these richer, multi-race, cases lead to more accurate PB predictions?Are more landmark races in our cases better than fewer landmark races? In general, yes, more landmark races mean lower PB prediction errors, as shown in Figure 2, for male and female runners. For example, when we include just a single landmark race in our cases (alongside the PB race) then the average prediction error is about 0.06 (or 6%), but when we include 4 landmark races then on average the prediction error drops to about 2–3%. And when we include all of the available landmark races then the prediction error falls to 2% across the board.

This is encouraging. It suggests that richer cases do help us to make more accurate PB predictions. But is it really true that adding more landmark races always leads to better predictions? Not exactly. Figure 3 shows the average error associated with 64 different combinations of landmark races (in descending order of prediction error, for Berlin). For reasons of clarity, the x-axis only labels a subset of the combinations, but we can see that sometimes simpler representations beat richer representations. For example, one of the best performing case representations (MR_LR_PPB) just includes the more/least recent races and the previous personal best, and yet it produces a more accurate PB prediction (lower error) that richer representations, such as MR_LR_PW_PPB, wher we also include the PW landmark race.

This tells us that not all landmark races are created equally. When it comes to making PB predictions some types of races seem to be more useful than others. In fact we can quantify exactly how useful by calculating the different between the error rates produced by representations when a given landmark race is included versus when it is excluded.

The results of this are presented in Figure 4. They tell us, for example, that when we include the most varied (MV) landmark race in our cases then there is a 10–12% benefit — that is prediction error improves by 10–12% — compared to when MV is excluded.

Different landmark races are associated with different benefits and not every landmark race helps with prediction error. For example, the least varied (LV) race has a negative benefit, which means that including LV tends to disimprove prediction error. It’s similar for the PW race. It seems that the LV and PW races are just not helpful when it comes to understanding our PB capabilities. Perhaps the finish-time associated with our worst race is simply not that predictive of our best time. But why doesn’t the LV race help? Especially when we are told that running an evenly paced race is likely to be a good racing strategy. One explanation might be that recreational runners running an evenly paced race are not really pushing themselves and so LV races tend to be less representative of what they are truly capable of. Other races are much more useful. For example, including the previous personal best race (PPB) improves prediction error by a whopping 30–40% on average.

Where does all of this lead? We have described a way to categorise the marathon history of a runner using different types of landmark races, some of which are helpful when it comes to PB prediction, while others hinder accurate predictions. One of our best performing case representations (MR_LR_PPB) only requires information on a runner’s most and least recent races and their previous personal best. So let’s use this combination to generate PB predictions for runners of different ability levels, as we did last year, and compare them to last year’s results.

These results are presented in Figure 5. The prediction error for the simple case representations used last year are shown as the graphs with the outline markers and we can see how prediction error grows steadily for slower and slower runner. In contrast, the results for the MR_LR_PPB cases, shown as the graphs with filled markers, show a much improved set of results for all three cities. The MR_LR_PPB prediction error remains low and stable across all finish-times. For example, previously the prediction error for runners capable of a 4-hour finish-time came in at about 6%, but now, with the MR_LR_PP case representation, it remains at close to 2%

This means we have a way to reliably predict challenging but achievable PB times for marathon runners of all abilities, using only information from their most and least recent races, and their previous personal best.

And that brings us up to date with this research. But there is still much to do. One of the obvius next steps, for example, is to consider whether we can make cross-city/course predictions. In other words, can we use a runner’s marathon history in one city to predict their finish-time in another? It will also be interesting to look at other endurance sports such as cycling or triathalons. If anyone knows of good data sources please leave as a comment.