Liam Pauling
5 min readNov 20, 2017

Using machine learning to accurately predict horse race duration

I specialise in trading inplay horse racing markets, a few of my algorithms depend on knowing how much of the race is left. I currently take the average time based on race type, course and distance however there are a lot of other factors that impact the likely winning time especially on longer distances. With machine learning being the big buzz word at the moment I thought this would be a great example to learn the basics whilst solving a relatively simple problem.

The data

The first step in ML is getting accurate data, of course this ended up being much trickier than expected. It’s possible to purchase historical horse racing data however BHA have a nice fancy website powered by an API which is exposing the data I require.

The issue is that the API employs very strict throttling at the IP address level. Implementing random sleeps in between scrapes and using a mix of ec2 instances and serverless I was able to get a years worth of data after a few days (still scraping the rest)

Features

What impacts the winning time? A lot. However the main features available before the race starts are the following:

  • Course
  • Distance
  • Race type (flat/jumps/chase etc.)
  • Track type (turf/awt etc.)
  • Going (fast/hard/soft etc.)
  • Race Class (1–6 with 1 typically involving more expensive horses thus quicker times)
Average Time Distribution

This is of course ignoring the actual runners and focusing on just the race. Plotting the distribution of average times it is obvious that race time does not follow normal distribution. I will be using a random forest where this is not a problem, if I were to use linear regression this would need to be normalised.

Obviously distance is likely to be the most important feature so here is a graph simply plotting the distance against racing times.

Distance v Average Time

As you can see it is almost a straight line as logic would suggest however it is not perfect especially on longer distances (jumps) where going / class start to heavily impact the race time.

Extraction

I will be using python and sklearn and therefore cannot provide strings such as ‘Flat’, ‘Jump’ etc. This of course opens up a can of worms, look up one-hot-encoding which I used when dealing with race type and track type.

Machine Learning

Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Wikipedia

Ironically very similar to how betting markets work (think wisdom of the crowd)

Regression: the output variable takes continuous values.

Classification: the output variable takes class labels.

What does this mean? Basically we have a regression problem as race time is not classified by group but is a continuous value. See below for the code required to create a very simple tree using just the distance as a starting point:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


race_data_df = pd.read_csv('export.csv')

n = race_data_df[['avg_time', 'distance_value']]

# split data into two
train_data, test_data = train_test_split(n, test_size=0.33)

x_train = train_data.drop("avg_time", axis=1)
y_train = train_data['avg_time']

x_test = test_data.drop(["avg_time"], axis=1)
y_test = test_data['avg_time']

# create and fit the forest
forest = RandomForestRegressor(n_estimators=100)
forest.fit(x_train, y_train)

# predict
y_pred = forest.predict(x_test)

# print the score
forest_score = round(forest.score(x_test, y_test) * 100, 2)
print(forest_score)

# plot predicted v measured
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, edgecolors=(0, 0, 0))
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=1)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
Random Forest results for just the distance

The score is roughly 99.11 which isn’t bad, the graphed predictions also demonstrates that it is much more accurate at shorter distances. Including the course, race type, race class and track type I get the following:

Predicted results
  • Flats score: 99.5
  • Jumps score: 95.0
  • Both: 99.3

Much better but jumps is still not great, this is hopefully where using the ‘going’ can help. Annoyingly the BHA provides unstructured data when it comes to going and it typically looks like this:

Straight Course: Good to Soft Round Course: Good to Soft

Luckily I have access to another data source (timeform) where the going is more structured.

Prediction results with going
  • Flats score: 99.4
  • Jumps score: 96.5
  • Both: 99.5

Big improvement on the jumps but looks to be having an adverse effect on the flat predictions although this could be a DQ issue as there are now a few outliers. This is probably where I should split the model into two as they are very different race types but happy with 99.5 for now.

Deployment

To me this is probably the most interesting part of the ML process but oddly there is very little on the internet regarding it. Out of all of the documentation for sklearn there is a single page for how to use pickle and deploy a model:

There is also a great thread on hackernews but it is more in regards to the pipeline process when it comes to keeping models up to date.

In order to use this in production I need it wrapped up into an API, this is where Flask and Zappa come in. Flask because you can create an API very easily:

app = Flask(__name__)

forest = pickle.load(open('racetimes/racetime_forest_v2', 'rb'))


@app.route('/predict-race-times', methods=['POST'])
def index():
data = json.loads(
request.data.decode()
)

# build array
p = []
for r in data.get('request'):
p.append(
[
r['course_id'],
r['distance'],
r['race_class'],
r['race_type_flat'],
r['race_type_jumps'],
r['track_type_turf'],
r['track_type_awt'],
r['course_id'] * r['distance'],
r['going_key'],
r['racetype_key'],
]
)

# predict
output = forest.predict(p)

return json.dumps(
{'response': output.tolist()}
)

And Zappa to allow deployment to AWS lambda (serverless)

Testing

Here are the results from a quick random test comparing the times from Saturday:

Wetherby 2017–11–18 12:30

  • Predicted 306.4s / Actual 299.6s

Wolves 2017–11–18 20:15

  • Predicted 119.6 / Actual 119.57s