How India bottled the 2022 T20 World Cup | Part 2: Data Preperation and Clustering

6 min readDec 19, 2022

Welcome to part two of this four part series where I try to dissect India’s ultimately disappointing T20 World Cup campaign. In the previous article (which you can read here) I discussed my motivations for completing this project. Briefly, I want to investigate how a team that had dominated bilateral T20s for so long somehow managed to fumble the bag when it actually mattered.

To complete my analysis, I am using a clustering method to understand the types of T20 matches which are played, and then I will use these clusters to get an understanding of India’s tactics, prior to and during the world cup.

In this article, I will present the data that I have used, and the preprocessing I completed to ensure that the data is ready for modelling. I will then outline how I used this data to train the clustering model.

Data Source

I have obtained ball-by-ball match data for T20 matches from this databank of cricket data. Below is an image of this data. The match_id column uniquely identifies each cricket match. The remaining columns give meta data about the match (season, start data and venue, teams) and ball-by-ball data.

Table showing snapshot of ball-by-ball match data

Data Preperation

The notebook for this section can be found here.

Date Range

The first step was to select the date range that I wanted to use. I chose to only include matches that occured after the end of the 2016 T20 World Cup, up to and including the final of the 2022 T20 World Cup.

Completed Matches

Next, I selected matches that have been completed and that have not been abandoned (e.g. rained off).

Teams

I wanted to focus on the top 12 teams according to the latest ICC rankings. The following teams fit this criteria at the time of writing:

During the specified time period, these 12 teams played 555 matches.

Creating Match-Level Data

Creating Relevant Features

It is required to develop two features, total runs and wickets. Total runs indicates the number of runs scored off of a delivery and wickets indicates if a delivery is a wicket. Total runs is a sum of runs_off_bat and extras. The feature wicket is derived as follows:

# Create wicket column indicating if a wicket occured on that delivery
# Where wicket_type is null, set wicket column to 0
df.loc[df['wicket_type'].isnull(),'wicket'] = 0
# Imputate remaining fields with 1 to indicate that a wicket did occur
df['wicket'].fillna(1,inplace=True)

Aggregation

I then aggregate on match_id, batting_team to get the following features for the whole innings:

Number of balls in the innings
Number of runs in the innings
Number of wickets in the innings
Start date

I then develop another table to aggregate on only the powerplay overs in each innings (deliveries bowled before over seven). This table contains the same features listed above.

After joining the match-level aggregated table and the powerplay-level aggregated table, I then create two seperate dataframes, one containing data on the first innings of the match, the other containing data on the second innings of the match. I then join these two tables together to create a table where each row has data on the first and second innings of a match.

I complete a final step of feature engineering where I create a feature which indicates if the team batting first won the match. I also complete a final filtering step where I remove matches which have been severely rain affected.

The figure above shows the final table that will be used for modelling. It contains:

match_id: unique match key
batting_team: team that batted first
ball: number of deliveries in the first innings
total_runs: total runs in the first innings
wicket: total wickets in the first innings
start_date: date of the match
total_runs_powerplay: number of runs in the first innings powerplay
wickets_powerplay: number of wickets in the first innings powerplay
Features 2–8 are repeated for the second innings

Now that the data has been aggregated to match-level, I can continue to prepare the data, with a focus on ensuring the data is fit to train a clustering model.

Preprocessing Data for Clustering

The notebook for this section can be found here.

For effective clustering analysis, the data needs to prepared so that the following criteria are met:

The data is symmetrical about the mean (normally distributed)
The variance and the mean of all the features are similar.

From the following figures, it is clear to see that some of the features have an unsymmetrical distribution.

Histograms displaying unsymmetrical distribution of features

These features need to be transformed so that they have a more symmetrical distribution. There are several methods available to complete this transformation:

Log Transform
Box-cox Transform
Cuberoot Transform

I transform the features using the above transformations, and from a visual analysis, it is determined that the Box-cox Transform yielded the best results.

The next step is to scale the features so that the variance and means of each feature are similar to one another. To achieve this, I use the StandardScaler package.

The final table which will be used for modelling contains the following features which have been transformed and scaled:

Wickets (Innings 1)
Wickets (Innings 2)
Wickets in Powerplay (Innings 1)
Wickets in Powerplay (Innings 2)
Total Runs (Innings 1)
Total Runs (Innings 2)
Total Runs in Powerplay (Innings 1)
Total Runs in Powerplay (Innings 2)
Batting Team Wins

Clustering

To complete the cluster modelling, we will be using the K-means clustering algorithm. This method is effective for data with low dimensionality and it allows me flexibility to specify the number of clusters I would like it to return.

Find ideal number of clusters

The first step is to use the elbow method to find the ideal range of clusters to use. To do this, I fit a k-means model to the data with an increasing number of clusters, from one to ten. I then plot the inertia of each k-means model and examine the plot to find the ideal number of clusters. Further information on using the elbow method can be found here.

From the above plot it looks like the ideal number of clusters is four.

I then train the K-Means algorithm on the prepared dataset specifying that I would the matches to be segmented into four clusters. At this point, the modelling is complete, and next I will analyse the clusters.

Summary

In this article I have outlined my process of obtaining and preparing the data that I will use to create my clustering model. I filter the data so that I only focus on matches that occur after May 2016 and feature teams who are in the top 12 of the ICC T20 rankings. I then manipulate the data so that I obtain match level aggregated and powerplay level aggregated data. The next step was to preprocess the data so that the features are symmetrical with similar means and variance. I use the elbow method to determine the ideal number of clusters — this analysis determined that four clusters is the best number of segments.

In the next article I will analyse the segments that the modelling returns.