“The Road to Glory: Predicting the ICC World Cup 2023 Champion”

Paul Andrew
9 min readSep 17, 2023

--

The 2023 ICC Men’s Cricket World Cup marks the 13th installment of this prestigious quadrennial One Day International (ODI) cricket tournament, which sees national men’s teams vying for supremacy under the aegis of the International Cricket Council (ICC). This edition is set to grace India as the host nation, with matches spanning from October 5 to November 19, 2023. Notably, the original schedule, planned for February to March 2023, had to be postponed due to the global COVID-19 pandemic.

A total of ten teams are gearing up to participate in this tournament, including the reigning champions from 2019, England. What distinguishes this edition is that it will be the first time that India plays sole host to the Men’s Cricket World Cup. In the past, India had collaborated with other nations on the Indian subcontinent to co-host the event in 1987, 1996, and 2011.

The grand finale of this cricket extravaganza is slated to unfold at the Narendra Modi Stadium in Ahmedabad on November 19, 2023. Additionally, the semifinals will be held at the iconic Wankhede Stadium in Mumbai and the historic Eden Gardens in Kolkata.

Objective:

The main objective of predicting the ICC World Cup 2023 winner using machine learning is to leverage data-driven techniques to make accurate forecasts about which cricket team is most likely to emerge as the tournament champion, helping fans, analysts, and teams gain insights into performance and outcomes.

Data:

Data related to ICC World Cup winners consists of historical records of past World Cup tournaments, including details such as the winning team, year, location, and possibly statistical information about the matches and players. This data serves as a valuable resource for analysis, prediction, and understanding the tournament's history and trends.

I have collected data from HowStat.com, which includes the results of ODI matches since the 2015 World Cup. While I acknowledge that the model’s accuracy may not be exceptionally high, I believe it provides a reasonably good sense of the trends. I chose not to include matches prior to the 2015 World Cup because I consider recent results to be more influential, with older data carrying less weight. For the remaining data files, I sourced them from the Cricbuzz website.

Environment and tools:

  1. Google Colab
  2. Numpy
  3. Pandas
  4. Seaborn
  5. Matplotlib
  6. Scikit-learn

I followed the general machine learning workflow step-by-step:

  1. Data Collection
  2. Data processing
  3. Data analysis and visualization
  4. Exploratory data analysis
  5. Feature Engineering
  6. Model Selection
  7. Model Training
  8. Model Evaluation
  9. Model Interpretation
  10. Final Prediction

Program:

Let’s dive into the practical coding aspect. The complete project on github can be found here.

1. Data Collection

I have collected data from HowStat.com, which includes the results of ODI matches since the 2015 World Cup. While I acknowledge that the model’s accuracy may not be exceptionally high, I believe it provides a reasonably good sense of the trends. I chose not to include matches prior to the 2015 World Cup because I consider recent results to be more influential, with older data carrying less weight. For the remaining data files, I sourced them from the Cricbuzz website.

2. Data processing

I started by importing the required libraries for data analysis.

After that, I loaded the dataset containing the ODI match results following the 2015 ODI World Cup and the dataset containing the recent rankings of teams.

3. Data analysis and visualization

I analyzed the World Cup dataset by creating visualizations based on various aspects, including the number of titles won by each team, their recent ICC ODI rankings, and their win percentages in past World Cups. For visualization purposes, I utilized Matplotlib and Seaborn.

I analyzed the team statistics by selecting the top 5 favorites of the World Cup, then I proceeded to analyze and visualize their performance against other teams in ODIs and World Cups.

4. Exploratory data analysis

Following that, I merged the details of the teams participating this year with their past results.

I removed columns such as the date of the match, margin of victory, and the ground on which the match was played. These features didn’t appear to be important for our prediction.

5. Feature engineering

Feature engineering is the process of creating new, meaningful features or modifying existing features in a dataset to improve the performance of machine learning models. It is a crucial step in the data preprocessing pipeline. The goal of feature engineering is to provide the model with more relevant and informative input data, which can lead to better model performance and predictions.

Feature engineering is critically important in machine learning as it directly impacts the model’s ability to learn and make accurate predictions. Well-engineered features can significantly improve model performance, often more than the choice of the algorithm itself. In short, effective feature engineering can be the key to success in many machine learning tasks.

Advantages of feature engineering:

  1. Improved Model Performance: Feature engineering enhances the model’s ability to make accurate predictions by creating or selecting relevant features.
  2. Dimensionality Reduction: It helps in reducing the number of features, which can lead to faster model training and improved model generalization.
  3. Enhanced Interpretability: Carefully engineered features make model predictions more understandable, aiding in the interpretation of results and insights into the data.

Continuing with the work, I created the model. If Team-1 won the match, I assigned it label 1; otherwise, if Team-2 won, I assigned it label 2.

feature selection

Then I converted team-1 and team-2 from categorical variables to continuous inputs using pandas function pd.get_dummies. This variable has only two answer choices: team 1 and team 2. It creates a new dataframe which consists of zeros and ones. The dataframe will have a one depending on the team of a particular game in this case.

Also, I separated training and test sets with 80% and 20% in training and validation sets respectively.

Train/Test Split

6. Model Selection

Model selection in machine learning is like choosing the right tool for a job. Imagine you have a set of data and you want to make predictions or find patterns in it. Model selection is the process of deciding which machine learning algorithm or method you should use to do that job effectively.

It’s a bit like picking the right type of vehicle for a trip. If you’re going off-road, you might choose a rugged SUV. If you’re commuting in a city, a compact car is more efficient. Similarly, in machine learning, you pick a model (algorithm) that suits the characteristics of your data and the specific problem you’re trying to solve.

Model selection involves trying out different algorithms, comparing how well they perform on your data, and selecting the one that gives the best results. Just like choosing the right vehicle can make your journey smoother, picking the right model can make your machine learning task more accurate and efficient.

I used Naive bayes theorem, Support V

ector Machines, Random Forests and K Nearest Neighbours for training the model.

Random Forest outperformed all other algorithms with 74.9% training accuracy and 67.3% test accuracy.

Random forest algorithm

7. Model Training

Random Forest is a powerful machine learning algorithm that’s like a group decision-maker. It builds multiple decision trees and combines their predictions to make more accurate and robust predictions. Each tree “votes” on the outcome, and the most popular choice becomes the final prediction. Random Forest is great at handling complex data, handling missing values, and avoiding overfitting. It’s commonly used for tasks like classification and regression, and it’s known for its high accuracy and versatility.

The popularity of the Random Forest model is explained by its various advantages:

  • Accurate and efficient when running on large databases
  • Multiple trees reduce the variance and bias of a smaller set or single tree
  • Resistant to overfitting
  • Can handle thousands of input variables without variable deletion
  • Can estimate what variables are important in classification
  • Provides effective methods for estimating missing data
  • Maintains accuracy when a large proportion of the data is missing
Decision boundary from random forests (as more trees are added)

8. Model Evaluation

Let’s continue. I added ICC rankings of teams, giving priority to the higher-ranked team to win this year.

Next, I added new columns with ranking position for each team and slicing the dataset for first 45 games since there are 45 league stage games in total.

Then I added teams to new prediction dataset based on ranking position of each team.

After that, I added scripts for getting dummy variables and added missing columns compared to model training dataset.

9. Model Interpretation

Ultimately, the following code is designed to retrieve the results for each match in the league stage.

For the results, please feel free to refer to the corresponding notebook. The four teams marching to the semi-finals are Australia, India, England, and Pakistan.

And then I created a function to repeat the above work. This is the final function to predict the winner of ICC Cricket World Cup 2023.

I ran the function for semi-finals prediction.

Hence the two finalists are Australia and India which is quite evident as they are considered the favorites to win this year. Also, they are first and second ranked team in ICC rankings.

10. Final Prediction

Finally on running the main function.

According to this model, India is likely to win this World Cup.

In conclusion, using machine learning for cricket World Cup winner prediction can provide valuable insights. However, it’s a challenging task due to the dynamic nature of sports. Effective feature engineering and model selection are crucial. While it can enhance predictions, it doesn’t guarantee accuracy due to unpredictable factors.

The Complete project is available on Github.

--

--