How Covid-19 prevented me from being a millionaire

Sébastien Cararo
Analytics Vidhya
Published in
9 min readDec 6, 2020

I created a sports betting algorithm that allowed me to multiply sixfold my bankroll in just two months — before Covid-19 suddenly stopped all sports leagues.

NB: This article is aimed to be somewhat condensed, and displays only the very key information. For exhaustive details, please refer to the full paper I wrote [1], containing all construction steps and mathematical foundations. In this article I only develop the NBA program process, as the other sports’ programs are also described in the paper.

Project’s genesis

Figure 1 : Data Science Venn Diagram
Figure 1 : Data Science Venn Diagram

As a 22-year-old sports enthusiast, I had always been fascinated by sports analytics and have been focusing on studying the many various statistics in the NBA for years.

I thought about such a project after attending multiple Data Science courses at University, but also after discovering several publications and scientific papers concerning the subject. I found it very interesting to modelize human dynamics with mathematical functions such as Neural Networks and other Machine Learning models.

How does it work ?

The idea of the pipeline could be summarized as “We seek to create a Machine Learning estimator of winning probabilities in order to compare these probabilities with the market odds, to then infer profitable betting strategies from this comparison”.

After having documented myself towards the subject, I realized that many of the similar projects only settled for creating the most accurate model. However, most of the time they couldn’t exploit their model to yield profit in sports betting because the bookmakers’ margin still compensated for the good accuracy of their models.

I then tried to figure out how to counter this, started creating my own models, then I finally thought about a way to identify biases in the bookmakers’ odds. The whole pipeline can be summarized with the following diagram, of which the main steps are briefly described within this article :

Figure 2 : Descriptive pipeline of the project (extracted from the paper)
Figure 2 : Descriptive pipeline of the project (extracted from the paper)

The steps that I will describe in this article are the following ones:

I — Data collection : creating a webscraper algorithm in Python

II — Variables computation : creating predictive variables to feed the ML models

III — Principal Components Analysis (PCA): Reducing variance in the fitting process by creating a new set of orthogonal components

IV — Fitting several ML models, with different hyperparameters settings

V — Creating profitable betting strategies thanks to an original heuristic (personal creation)

VI — What about the results?

I — Data collection : creating a webscraper algorithm in Python

The first step was to collect raw historical data, for this I created a python webscraper to collect data on www.oddsportal.com. I published my package on my GitHub page (GitHub page [2], repository [3]). The package is user-friendly and allows anybody to scrape any league in practicaly any sports.

The only raw data that I needed to start from was an historical dataset which contained the result of each game, along with the teams names and the closing odds proposed by the bookmaker Bet365. Closing odds are the odds available just before the beginning of the match. It is very important to have this particular data, because odds can fluctuate a lot between the opening and the closing time, mainly to adapt to players’ injuries and with gamblers staking one or another team.

Figure 3 : Screenshot of the web scraping algorithm running
Figure 3 : Screenshot of the web scraping algorithm running
Figure 4 : Scraped table overview
Figure 4 : Scraped table overview

II — Variables computation : creating predictive variables to feed the ML models

After having removed pre-season, all-star and playoff games, the next step was to create predictive variables in order to fit the different Machine Learning models on. I coded a total number of 133 features for this pipeline.

To summarize, predictive variables can be grouped into : record variables (W/L), performance variables (average points scored, against, etc…), market-related (e.g. ROI of the team), ranking variables (ELO).

III — Principal Components Analysis (PCA): Reducing variance in the fitting process by creating a new set of orthogonal components

After having correctly computed and checked the 133 predictive variables, we can first have a look at the Pearson’s correlation matrix.

As a general statement, we can notice that a lot of the variables are totally uncorrelated, which might be a good sign as this would mean that we can capture different statistical patterns in the dataset. This could also mean that we have noise in the dataset.

Figure 6 : Pearson’s correlation matrix
Figure 6 : Pearson’s correlation matrix of the training set

I then applied Principal Component Analysis (PCA), in order to filter out the possibly noisy information in the dataset, but also to reduce variance in the fitting process. Actually, the newly created Principal Components are all orthogonal, which enhances the ML models performances.

NB : PCA has to be fitted solely onto the training set, in order to avoid causality issues for testing and future predictions.

PCA is a very powerful technique to enhance the Signal-To-Noise ratio (SNR) because we can then select the first components to fit the models on, in order to capture most of the information in a limited number of predictors. It can typically be a huge benefit to our study because many variables are nevertheless correlated.

In the following figure , we can identify a clear “elbow” pattern in the Cumulative Proportion of Variance Explained (Cumulative PVE) graph, which might indicate that the majority of the information would be contained in the first 15 PCs (i.e. that there is not so much variance in the last PCs). However, we will perform cross-validation when building the models (with nPCs = 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 80), in order to compare the results obtained with differents sets of PCs. The final goal will be to determine the optimal number of PCs such that the bias/variance tradeoff is as good as possible in terms of accuracy.

Figure 7 : Cumulative PVE graph over the training set
Figure 7 : Cumulative PVE graph over the training set

IV — Fitting several ML models, with different hyperparameters settings

I chose to fit multiple types of models over the newly created Principal Components, and to fit them with different hyperparameters settings. The four types of models that I used were Random Forests, Support Vector Machines (SVM), averaged Neural Networks (avNNet), and Discriminant Analysis (LDA/QDA).

These models were trained to predict the outcome of the game in a binary way : either predict the home team (‘H’) or the away team (‘A’) as winner. I also fitted the models to output estimated probabilities of victory. All models were trained using “accuracy” as a metric, even though various criterions such as gini, entropy or Area-Under-the-Curve (AUC) could have been used in this case of binary classification.

The testing set for all models was composed of the very last season at the time (2018/2019 season — 1230 games), meaning I kept 9 seasons for training the models. The results presented below show a slight domination of the avNNet models concerning forecasting accuracy. The best models are highlighted in dark green and the best models by type of model are highlighted in light green.

NB : During the testing period, Bet365’s favorite won 67,64% of the time. We can see down there that my best models are over performing their estimations over the testing period.

Figure 8 : Comparison between models accuracies over 2018–2019 season (testing set)
Figure 8 : Comparison between models accuracies over 2018–2019 season (testing set)
Figure 9 : Accuracies as a function of the number of PCs used for regression
Figure 9 : Accuracies as a function of the number of PCs used for regression

V — Creating profitable betting strategies thanks to an original heuristic (personal creation)

As described earlier, the literature shows a lot of very powerful models which take many parameters as input and employ a lot of computation power to compute the variables and the models. However, few of them are capable of yielding a great betting strategy, because even if the estimator accurately estimates chances of victory, one also has to cope with the bookmakers margin.

[EDIT 17/02/2021] Without getting too deep into the details for confidentiality reasons, I created a metric that allows us to identify biases in the bookmakers’ odds estimation.

Once we have identified our profitable strategies, we simply compute predictions for future games, identify games that belong to profitable strategies, and send the upcoming bets via an email. In the case of conflicts between strategies (i.e. strategy A says to bet Home while strategy B says to bet Away), we pick the best strategy’s prediction.

Voilà! We only need to bet to make profits.

The output in RStudio was looking like this (now it is sent via email as a .csv):

Figure 12 : Output in RStudio

VI — What about the results ?

NB: The pipeline has been evolving all the way through, hence the first bets were made with an algorithm that is way less sophisticated than the final version that is being presented in this article. One can even notice that the profits are going higher and higher as the weeks pass.

Thanks to this pipeline, I’ve been able to apply betting instructions derived from this algorithm, and was able to earn 856€ thanks to the NBA games. The NBA league is by far my best program, which is delightful considering the fact that there are a lot of games over the 6-month season. Here are a few plots that display the path to earning this amount :

Figure 12 : Cumulative earnings and marginal earnings for NBA games
Figure 13 : Cumulative earnings and marginal earnings for NBA games

Then we can have a look at the results including the other leagues’ betting record.

As of the 13rd of march, 2020, the application of the described strategy, coupled with a staking strategy along which I staked more on well-performing leagues such as the NBA and the Italian football league Serie A, made me generate a net profit of 1284,33 € with an initial bankroll of 250€ between the 31st of December and the 13rd of March, with a Return On Investment per game (ROI) of 7,42%, which perfectly fitted my expectations — this is definitely a highly-competitive ROI :

Figure 13 : Profit evolution since January 2020 (€) as a function of the total stakes placed
Figure 14 : Profit evolution since January 2020 (€) as a function of the total stakes placed
Figure 14 : Overall betting statistics
Figure 15 : Overall betting statistics

The full paper contains exhaustive information about the significancy of these betting results, in particular via Monte-Carlo simulations.

Conclusion

As a general conclusion, we can be pretty satisfied of this pipeline. It mixed my appetite for data analytics with my passion for sports. Also, I hope this article can be helpful to people who want to start their own data-derived betting strategies.

Feel free to contact me for further information/help regarding the subject of programmatic sports betting. I also built similar pipelines for various sports and leagues (76 different programs built now! — see full paper and extensive documentation about the project, [1]).

Don’t hesitate to leave any feedback/questions/claps or to contact me for more information. I am available for commercial discussions : personal use, commercial use, property rights.

Contact : sebcararo@hotmail.fr

Another article : I. A turnkey Python code for creating a stock forecasting dataset | by Sébastien Cararo | Analytics Vidhya | Dec, 2020 | Medium

References :

[1] Full paper : Paper_Exploiting_bookmakers_biases.pdf (seb943.github.io)

[2] GitHub page : Seb943 (github.com)

[3] ScrapeOP repository : Seb943/scrapeOP: A python package for scraping oddsportal.com (github.com)

Softwares used :

I coded everything from scratch in R 3.6.0. The web scraping program was developed in Python 3.6.2 using the Pyzo environment.

Justification of the title : In 73 days we multiplied by 1534.33/250, so in 366 days we could have multiplied by r = (1534.33/250)^(366/73)= 8926.64. The final bankroll would have been 250*r = 2 231 660 €. Obviously, the title was meant to be catchy, not a real estimation.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Ever interested in high-quality tools for professionals and DIY enthusiasts ? Check out this awesome online store ! Home — ToolsyLand

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--