Datascience prediction — Racing — Datascience — Photo by Marvin Ronsdorf

From Duels to Sport Racing

Published in

L’Actualité Tech — Blog CBTW

9 min readOct 31, 2019

Or how small and seemingly limited predictions ends up depicting the whole result

In March of this year our data science team started to work on a new project : an exciting one with tons of data and on a subject that is rarely tackled in Data Science : sport racing! The goal is to develop a tool that helps bettors make the right choices by identifying the best runners and the outsiders in a race. How ? By predicting for each runner his probability to be at any given rank. To achieve this, we have at our disposal data on the runners (age, various statistics, prizes won…) and on the event — the particular race — itself (time of the day, temperature, nature of the track…).
In this article, I will show you how the Datawok team managed to achieve expert-level scores on this problem.

Our objective

The initial objective can be expressed as simply as : given data on the runners participating a race (also called event in this article) and data on the race itself, are we able to predict for each runner his probability to be at each ranks at the end of the race?

More formally, let be :

N the number of initial runners in a particular race (this value is dependent on the race)
K the number of features used to describe each runner (this value is always the same)
G the number of features used to describe the event (this value is always the same)

The following table for example shows the results of two different events.
For each runner we have K=2 features (Age & Prizes won) and G=1 (Track Length).

Given those parameters, the following solutions could be implemented:

predicting given all the features of all the runners in a race and the features of the race a matrix of shape N,N where the value i,j is the probability that the runner i finishes the race with the rank j
or predicting for a given runner/rank couple a single probability

In both cases, the question is: what is considered a sample? A race? A runner reaching a rank?

First models and first limitations

Both of these solutions comes with their inherent limitations:

In the first case, both input and output have different shapes or sizes depending on N (the number of runners) which varies a lot from a race to the other. This means that a padding of some sort must be applied to the input and output data. Another issue with this representation is that the input is dependent to the order in which you arrange the features of the various runners. This also means that you have to come up with a quite arbitrary order for the runners
The second case seems at first a bit better: the size or shape of our samples doesn’t depend on the initial number of runners in the race. Nonetheless, it gives a prediction for a couple runner/rank without any information on the competitors of the said runner. A solution could be to add a fixed-sized array of features representing the other competitors. This would ultimately mean working with aggregates (averages, max, min, etc..) and it will inevitably hinder the quality of the final prediction for a given couple runner/rank. Assessing the quality of a set of runners from fixed-size aggregates didn’t seemed to be a convenient solution in our precise case

All this being said, we now need a solution that satisfies the following constraints:

Both the input samples and the output predictions must have a constant dimension.
The final prediction for a given couple runner / rank must be produced knowing the data on the runner’s competitors and preferably without requiring any prior aggregation.

Duels is all you need

Albeit the previously mentioned issues and limitations could still provide decent results, we took this as an opportunity to experiment a new paradigm and came up with the following solution: Each race is now considered as N*(N-1) duels between all possible couples of competitors (distinguishing the duel A vs B from the duel B vs A). A duel, i.e. the relative ranking between two competitors in race, is now our sample.

Each sample is now of known and constant size: K features to describe the first runner, K more to describe the second and G features to describe the whole event. Also, the output is the probability that the first runner loses against the second.

Learning from those features the order of a specific runner couples (our target) allows us to estimate a matrix of shape N,N where the value i,j is the probability of the runner i finishing the race after the runner j. Unlike the last matrix we mentioned, we are not here estimating probabilities on runners/ranks but instead the probability for the runner i to finish the race after the runner j.

In this new paradigm, we both:

manipulate samples of constant size (independently of the initial value of N) and outputs of constant size
give our prediction for each runner to beat every other competitor given both of their features

Bear in mind that at this point, a sample is a tuple A,B,G,Y with:

A being the features of one runner in a race
B being the features of one competitor in the race
G being the features that describes the race itself
Y being either 0 or 1 if respectively A looses or B loses this duel (it is equivalent to predicting if the final rank of A is greater than the final rank of B)

The first race of the example table (where Event Id is 0) can in this context be represented as 12 duels (4 players and each time 3 competitors) and each duel becomes a sample. Note that this also artificially increases the number of samples : with R races and with an average of 10 runners per race, we end up having around R*100 samples.

This “duel” matrix, as we could call it, would looks like the following:

Again, each cell of the matrix (green for high values and clear yellow for lower ones), is a probability predicted by our model ; the probability that the runner i in line would lose against the runner j in column. The model is not specifically trained to predict that f(A,B) = 1 -f(B,A), A and B being runners and f(A,B) being the probability that the runner A would lose against the runner B. We made the decision to use the value (f(A,B) + 1-f(B,A))/2 (the mean of the two predictions) as the predicted result of the duel between A and B.

Let’s sum up:

Each race consists of N records, one per runner
Each record contains data about both the race and the runner. Each record also provide the final rank of the runner on this particular race
The records are regrouped by race and each race of N runners is transformed into N*(N-1) duels

To illustrate this process, let’s see how the first race in our example would be represented as duels:

This is how our dataset was built. From 11K races dated between 2017 and July 2018 we’ve produced over 1 million samples (the duels) to train. From 5K races dated between June 2018 and June, another 0.5 million samples was produced to test the model.

The big picture

We had to do a lot of feature engineering and data cleansing. A lot of our runners features where computed on their pre-race history (number of time they got ranked 5 or better in the past 12 months, number of disqualifications in the past 6 months, etc..). However, this article being mainly about how we represented our data, we will not go through each and every step of the machine learning part. We went through the typical steps of a Data Science project: business understanding, feature engineering/cleansing, modeling, evaluation until reaching satisfying results.

We’ve used an XGBoost classifier on our data and the metric used was the classic accuracy. Note that a random model that would predict 0 or 1 randomly on each pair would have an accuracy of 0.5 since for each sample A against B with a target of value 1 you have exactly one sample B against A with a target of value 0.

Speaking of metrics… Reaching a given score when predicting the correct order of a pair of competitors is merely a part of the solution. We still need to find a way to go from those small estimations to the final runner/rank matrix: given a model that predicts for a pair of runners their relative ranking (either A finishing after B or the opposite), we can predict all the duels in a race for a specific runner. This will give us his probabilities to be beaten by all the other competitors. We can have for example the following array of probabilities, the i-th value being the estimated probability of A losing against the i-th competitor in a race. Taking back our previous example, lets see what are the estimations made by our model for John who came first in the race:

We can now compute the probability for John to reach each rank by simply simulating M times all his duels. For example, by taking this previous table, a simulation would consist in 3 random numbers between 0 and 1 (one per competitor). If the value is above the predicted probability, the duel is considered a win. Reaching the first rank means for example loosing no duel. Reaching the n-th rank means losing exactly n-1 duels.

Keeping this last example, we can do a simulation by generating 3 random numbers and by counting the duels lost:

In this case, 3 simulations were made. In 67% of them, John reached the first rank and in 33% of them, we estimated that he would be on the second rank. In our case, M — the number of simulation per runner — was empirically set to 4000.

Applying this method to each runner gives us the the lines of our matrix: each line represents a runner and each column a rank. Now, one can easily convince himself that the probability in lines sums to 1 but not the columns.

On the business side, this matrix is used to produce rankings by choosing a runner in each column (ie. the ranks) according to the aforementioned probability (the higher is the probability of the runner being at a rank, the likelier he is to be picked for this rank). Hence the column-wise normalization.

The following matrix results of the application of the simulation method we exposed earlier to the first matrix we saw (the “duel” matrix). Therefore, in this matrix, a cell i,j is the probability for a runner i to finish the race at a rank j:

The results

The KPIs measure how well the algorithm predicts k runners among the n top ranks. They can also take into account if the order among these k runners is respected. We know that having a score of 0.5 when predicting the order of a pair (a duel) will result in a random ranking. We also know that having a score of 1 when predicting the order of a pair will result in a perfect ranking. But between those two extreme values, we have no way to assess the quality of the prediction from the sole score of the model.

As we said in the first paragraph, we achieved expert-level scores (in term of KPI) on this problem. And the most fascinating fact in my opinion is that we managed to perform really good on the KPI with a score of … 0.65 in accuracy in predicting the correct rank in a duel. Isn’t that surprising? With 0.35 of the pairs misclassified we still have globally good results as if by asking for N runners our model to do N*(N-1) prediction we spread the error instead of cumulating it!