IPL 2022 Win Predictor (2nd Inning)- Data Science Project

2nd Inning winning prediction using Logistic Regression.

6 min readApr 15, 2022

Over the past decade, the popularity of cricket and the Indian Premier League (IPL) has witnessed an exponential increase in India. As an avid Data Science enthusiast, working with IPL data has always been an exciting task for me.

Currently, I am working on an end-to-end project where the target score and a few other necessary inputs are submitted, and the model predicts the probability of the batting team achieving the target score. The output is presented in percentage after calculating the probabilities using Logistic Regression, which is a powerful Machine Learning algorithm.

Moreover, analyzing the IPL data can provide valuable insights into team and player performance, strategies, and trends. It can also help to identify potential opportunities and challenges for the teams and the league as a whole. Therefore, working with IPL data is not only fun but also highly informative and insightful.

Data Set

The IPL Data Set, covering the years 2008 to 2021, was obtained from Kaggle. The link to access this dataset is https://www.kaggle.com/datasets/vora1011/ipl-2008-to-2021-all-match-dataset.

This comprehensive dataset comprises two CSV files. The first file contains information on every match played in the IPL, including the winning and losing teams, the venue, toss details, total runs scored, date, league stage, and much more. On the other hand, the second CSV file provides ball-by-ball information for each match, with columns such as innings, ball number, total run, and more, that can be used for a detailed analysis of the matches.

Using this dataset, we can derive valuable insights into the performance of individual players, team strategies, the impact of the toss on match outcomes, and more. Furthermore, the ball-by-ball data can provide detailed insights into match dynamics, allowing for a more nuanced understanding of the sport. Overall, the IPL Data Set is a treasure trove of information that can be used for in-depth analysis and exploration of one of the most popular cricket leagues in the world.

Preprocessing the data

Data preprocessing is a crucial step in any Machine Learning project, as it involves transforming raw data into a format that can be easily used by models. After collecting the IPL dataset, the next step is to preprocess the data as per the model requirements.

The preprocessing stage involves several important tasks, such as removing null values, converting columns into binary format (0s and 1s), and selecting only the required columns for modeling. Additionally, we may need to transform the data into a standardized format, such as normalizing or scaling the values.

By preprocessing the data, we can ensure that our models can make accurate predictions and extract meaningful insights. This process can also help to address issues such as data redundancy, inconsistencies, and errors, which can affect the quality of our analyses. Overall, proper data preprocessing is essential for achieving reliable and accurate results in IPL data analysis.

Matches Data Set

Here is the head of raw data of matches played:

Here is the head of required data for modelling from matches csv file:

Required Columns from Match Data set and creating new MatchDF

Ball-by-ball Data Set

Here is the head of raw data of ball-by-ball details:

Here the required columns from this data is extracted and used in final DataFrame.

Final Data Set

After combining the two tables based on their unique match ID, here is the final IPL dataset:

The final dataset is the result of combining the two CSV files that contain information on IPL matches played from 2008 to 2021. By using the unique match ID as a key, we can merge the two tables to create a comprehensive dataset that contains both match-level and ball-by-ball information.

The final dataset may contain additional columns that were created during the data preprocessing stage, such as binary columns or standardized values. We may also have removed certain columns or rows that were deemed irrelevant or problematic for analysis.

The final dataset can be used for a wide range of analyses, from descriptive statistics and data visualization to Machine Learning and predictive modeling. By utilizing this dataset, we can gain a deeper understanding of the IPL and its players, identify trends and patterns, and make data-driven decisions.

Final DataFrame with new cols as crr, nrr, balls_left, wicket_left, runs_left

Modelling

After extracting and preprocessing the required data, we can easily build a predictor using the scikit-learn library in Python. The basic approach used here involves using the current run rate and required run rate as input variables and predicting the probability of achieving the target score using the logistic regression algorithm.

The steps followed for modeling the predictor are as follows:

Splitting the data into training and testing sets using the model_selection module in scikit-learn.
Converting the data into a 2D list format using OneHotEncoder to enable easier analysis and processing.
Building a logistic regression model to predict the probability of achieving the target score.
Using the model in a pipeline with OneHotEncoder and logistic regression as the steps.

To assess the performance of the predictor, we check the accuracy and compare it with other models, such as the Random Forest Classifier. Although the Random Forest Classifier produces better accuracy than the logistic regression model, it is not preferred in predictor models because it produces stiff probabilities, which are not very helpful. Therefore, for this predictor, the logistic regression model is preferred over the Random Forest Classifier.

By building a robust and accurate predictor, we can make better decisions when it comes to predicting the outcomes of IPL matches, providing valuable insights for teams and fans alike.

Uploading using Streamlit and Heroku

After building the predictor, the next step is to deploy the project using Streamlit and Heroku. The following steps were followed for deployment:

Converting the pipeline into pickle to make it easily deployable.
Using the pickle and dataset for creating a Streamlit website that allows users to input relevant information such as batting team, bowling team, venue, stadium, target, current score, wickets fallen, and overs done.
Converting the predicted probabilities into percentage format for better user experience.

The final output is a web application that can be accessed through the Heroku platform. The application provides users with a user-friendly interface to input the required information and get the predicted probability of achieving the target score.

By deploying the predictor using Streamlit and Heroku, we make it accessible to a wider audience and enable them to make data-driven decisions in the context of IPL matches. This is a valuable tool for cricket enthusiasts.

Here is the final output of website published:

Conclusion

Building a machine learning model is always an exciting and rewarding experience, and developing a prediction model is no exception. Throughout this project, I learned a lot and with the help of online content, the process was seamless.

In conclusion, this project required a lot of data processing techniques that are equally essential as the modeling itself. I learned how to preprocess data, concatenate tables, split data into train and test sets, and build a pipeline with OneHotEncoder and logistic regression. Furthermore, I compared the accuracy of logistic regression and random forest classifier and found that logistic regression was more suitable for this predictor.

Overall, this project taught me valuable skills that I can use in future machine learning projects. I look forward to applying these techniques to other domains and exploring the exciting world of data science further.

Bhupesh Singh Rathore — Portfolio

Follow me on — LinkedIn | YouTube