[Data Collection] Project -Hong Kong Horse Racing Prediction — Part I

6 min readJun 22, 2023

Introduction

This project aims to predict horse racing results at the Hong Kong Jockey Club (HKJC). In addition to conventional factors like jockey statistics, odds, and past race outcomes, we explore the efficacy of a more targeted approach. Our aim is to develop a predictive model that considers the individual performance of each horse and jockey. To accomplish this, we introduce a ranking system specifically tailored to assess the capabilities of both horses and jockeys. The goal is to evaluate whether this ranking system can effectively forecast the final standings of each race.

HKJC website : https://www.hkjc.com/home/english/index.aspx

Background

The Hong Kong Jockey Club holds a prominent position as one of the world’s leading racing administrators. With two race tracks located at Sha Tin and Happy Valley, it organizes an extensive calendar of nearly 700 horse races annually. The popularity of horse racing in Hong Kong is evident from the substantial revenue it generates, with horse racing bets contributing a staggering CAD$23 billion in the 2020–21 fiscal year. This mature and transparent gambling industry provides a rich source of data that we can delve into for insightful analysis and predictions.

Hypothesis and Methodology

After reviewing various publicly available research studies, it has come to light that most of them utilize odds and final placement as variables for predicting racing results. However, these models have not demonstrated satisfactory performance, leading to the hypothesis that certain issues may be hindering their accuracy. The following problems have been identified:

Variation in Racing Combinations: Each racing event features a unique combination of participating horses and jockeys, rendering the place and odds of other races irrelevant for accurate prediction. Therefore, these variables should not be considered in the prediction process.
Influence of Distance: The performance of horses varies depending on the race distance. Some horses may excel in long-distance races, while others may not. Therefore, it is crucial to account for the distance factor when making predictions.
Impact of Randomness: In cases where the difference in finish times is minimal (e.g., 0.01s to 0.3s), the outcome can be greatly affected by random factors. Thus, using final placement as a sole variable for modeling may not be ideal.
Effects of Suspension: Horses that have been suspended for an extended period may experience a decline in performance upon their return, which could impact their future results.

To Solve those problems, We are trying to make some improvements on it :

Ranking System: A ranking system will be implemented to assess the abilities of both horses and jockeys. This system will assign meaningful scores to each participant in every race, similar to ranking systems used in games like League of Legends. The rankings will reflect the skills and performance of the individuals, regardless of the specific race in which they participate.
Focus on Long-Distance Races Prediction: To mitigate the impact of randomness, only data from long-distance races will be used for prediction. This approach aims to reduce the influence of small time differences.
Time Factor: A time factor will be introduced, gradually decreasing the score of horses that did not participate in each Race Meeting. This factor considers the consistency and regularity of a horse’s participation, ensuring that their performance is accurately reflected in the rankings.
Consider the Time Gap with the First Champion Horse: In addition to using final placement as a scoring factor, the time gap between each horse's finish time and that of the first-place champion will also be considered. This provides a more nuanced evaluation of performance, capturing the relative success of each horse in relation to the winner.

Process

Data Collection

Extract race data from the HKJC website by accessing the history data using the exact date in the URL.
Store the collected data, which is approximately 20MB in size, in an Excel worksheet.

2. Data Cleaning and Transformation

Remove special races and overseas races from the raw data.
Exclude races with disqualified or void participants to maintain data quality and focus on the main analysis.

3. Data Modeling (1)

Generate a score for each horse and jockey.
Deduct a penalty from scores if a horse or jockey did not participate on a specific racing date.
Allocate a percentage of scores to the pool for each race, distributing them based on the final placement and time gap after the champion.

Benefits of this process:

Adjust scores for horses and jockeys that haven’t participated for a long period to account for potential performance changes.
Allow scores to increase significantly for horses that win against competitors with higher scores, and vice versa.
Align scores closely with the champions when finishing times are similar.

4. Data Modeling (2)

Keep records of the horse’s speed and score at each race result.
Utilize these scores for training the machine learning model.

5. Training Model

Set two prediction goals: place and return.
Analyze the correlation between scores and actual results.
Train the model using the parameters in the ranking system to achieve accurate predictions.

6. Data Virtualization

Employ Power BI to visualize the horse race history and prediction results.
Use Power BI to present the historical performance of the races and display the predicted outcomes.

Data Collection

To scrape data from the HKJC website, we can use the Selenium library in Python. Here are the steps involved:

Step 1. Understanding the URL of HKJC. Each race’s data can be accessed through a specific URL that includes the race date, racecourse, and race number. We can use Selenium to loop through each day, starting from 2010.

https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2023/05/24&Racecourse=HV&RaceNo=4

Step 2. Save all data in a excel file called RaceDateList

race_date_list Sample

Step 3. We loop all Dates and all match. Then merge as a table and export as a file called RacePlaceData

We also need a table that stores the Course, Distance and Location, those data won't store at the table, thats why we extract the text above the table and save it as a file called RaceInformation.

*Class is the Ranking of Race, Class 1 is the Best Ranking

Data Cleaning

Once the data has been extracted from HKJC, it is essential to perform data cleaning to ensure the dataset is suitable for analysis.

Special Incidents — Some Horse was Disqualified/
We need to remove certain data points related to special incidents, such as disqualifications or withdrawals. These incidents can be identified by the remarks in the “Pla” field.

2. In addition,it is important to exclude international races and matches that do not fall into the desired class categories. This will help ensure that the dataset remains focused on local races and relevant classes.

3. To further refine the dataset, it is important to remove columns that are not required for our analysis. In this case, we can remove the “Running Position,” “Dr”, and “LBW” columns.

The clearing process is better done in Excel and saved as .xslx

Data Transformation

This is the final data structure we want to construct. The jockey, Horse table is the updated state of each participant, which is used for the BI report.