# Projected Scores: Reliving childhood memories through Data Science

**Introduction:**

Back again into cricket, I was running out of ideas for any analytics and modelling exercises. The lack of ideas was partly due to the lack of data and also my inability to come up with creative ideas. However, the availability of ball-by-ball data in cricsheet.org ranging from IPL, ODI, Test, Big Bash and even PSL made me exuberant. Separately, I have always admired the score and win prediction especially the scientific ones which started appearing in 2015 World Cup and thereon continued till the most watched Women’s World Cup event.

**Problem statement:**

Combining both the above, I decided thereby to predict the final score midway of any IPL match in the future similar to what you see projected scores in an ODI international match

**Data:**

Before we get into the intricacies let us be clear that the data corresponding to only 1st innings was used since a typical score prediction happens only for the first innings and the reason is obvious. Crichseet.org provided ball-by-ball data of all matches till IPL 10 and each ball-by-ball data had data points ranging from the striker, non-striker, bowler, runs scored, extra runs, mode of dismissal, batsman dismissed, type of extra.

Apart from these ball level details ranging from the start to end of the match there were overall details of the match and some of them are winning team, player of the match, the teams playing, ground, unique match id, toss, umpires, city, match held date, season, margin of win. However, out of the 636 matches held till now only 625 matches could be collated, cleaned and used for final analysis although split into train and test data i.e., train data having matches from ’08-’16 and test data containing ’17 matches. To give the reader a quick view of how the data looks post collating and structuring it

Sample ball-by-ball dataset from cricsheet.org

However, for one to view the raw data which was downloaded from cricsheet.org and the codes used to collate/structure/creating train dataset/model could be found in my Github account https://github.com/.

**Feature Engineering:**

As seen above, the structured/cleaned data was aimed at creating a ball level dataset and then it was further aggregated at an “over” level which had some level of feature engineering.

Some of them are creating home ground variable, obtaining IPL 1st innings batsman strike of all batsman featuring in each over be it two or three, balls faced by each of the batsman.

It also includes the corresponding over’s bowler economy strike rate (again based on IPL 1st innings data), bowler’s economy rate, runs obtained from extras at the point of prediction, wickets again till the point of prediction and other present over level details. The final response variable was obviously the final first innings score against each of the overs in that match. Again, to give a snapshot of the dataset which was grouped at an over level

Feature Engineered variables for modelling

**Missing value Imputations:**

There were some very basic imputations required, most importantly to tackle the Inf values in bosr variable. These Inf values were addressed by filling them with the mean of all other values. The other two imputations were NAs in balls faced and strike rates of second, third, fourth and fifth batsman options. These were simply addressed by filling the NAs with zeroes since in reality there was no batsman facing balls.

**Modelling:**

Since the response variable was numeric we do need a regression model and in this exercise we have used random forest, extreme gradient boosting (xgBoost) but however neural networks also could be used if required. To start with, random forest was used where all variables were provided for the model and that as always helped us in selecting the important variables. Some of the variables from them were selected for the final/xgBoost model and the same is shown below:

Variable Importance Plot

As you could see, some unexpected variables like b1sr, b2sr the batsman’s strike rate came out to be very important variables. However, home ground, bowler’s economy rate, bowler’s strike rate have also come out to be reasonably important. As in the Duckworth-Lewis method wickets and overs which are the “resources” the team typically has. Surprisingly, the number of extras and the runs from extras also come out to be important.

These variables are considered or understood as important since the graph on the left shows the increase in mean squared error with removal of any corresponding variable. In simple layman terms, say removing b1sr increases the mean square error by almost 100% otherwise implying the decrease in accuracy. In case of classification problems, the analogous graph shows the decrease in accuracy corresponding to the removal of each variable similarly.

So, with the important variables i.e., starting from wickets.ever to b1f were considered, post which there was only negligible decrease in the mean squared error, and the model was trained on data ranging from ’08-’16. Now, it was tested on all matches in 2017 and the below sample result i.e., the prediction on one of the matches (SRH vs RCB 2017 1st IPL match) is shown below:

SRH vs. RCB actual score prediction

x-axis in the above graph is overs and y-axis has the final score predicted at the end of each overs. The above mentioned match’s first innings score was 207. As you could see the prediction was not accurate even to a certain extent in any of the overs except for 15th (to an extent) and 20th.

The latter is merely because the Runs.ever variable at the end of 20th over will be the same as the response variable. This in reality also doesn’t make sense as there is no point predicting at the end of 20th over.

Therefore, a superior model like xgBoost was used for the same train dataset except for the

fact that dummies were used to convert the categorical variables into numeric format, since xgBoost can work on only numeric variables, i.e., home.ground, ground into corresponding variables with 1s and 0s.

**XgBoost model:**

Xgboost also gives variable importance similar to random forest and as we see below the variables are again almost similarly the same though the rank of importance changes

Variable Importance Plot — XgBoost

Now, again with the same important variables an xgboost model with the booster being gbtree, evaluation metric being mean absolute error. A comparison of these two models was done and the same is shown below:

Model Comparison in terms of score: RF & XgBoost

X-axis again in the above graph is overs and y-axis has the final score predicted at the end of each overs. The results now obviously look better since xgboost has always been a great model. As you could see, right from the 6th over there has been an error of around only -10 except for one or two instances and even if not the xgBoost model is more close to 207 across all the overs. It is also clearly evident that the error associated with initial overs would be very high given the start of the match/lack of data and also a comparison between the two models across the overs would be insightful

**Model Accuracy:**

Over wise error comparison across models basis entire test dataset

20th over had a root mean square error of around 5 in both the models i.e., lower the root mean square value perfect is the accuracy and on a standalone basis one can’t evaluate a good/bad mean square value. Typically models are compared with their corresponding root mean square error and the better model is chosen.