Data Analysis on IPL Data

ProlayBanik
5 min readSep 17, 2020

--

Being a cricket lover, I was waiting for the start of IPL,2020, as we all know this is the best tournament of the world. So, I thought to introduce myself performing IPL Data analysis with some data of IPL matches which I’ve found in Kaggle.

What’s within?

This data set consists of IPL matches and its details till season 10. It includes the following:

  1. The number of matches per season
  2. The Team who won by maximum runs
  3. The Team who won by maximum wickets
  4. Top cities where the matches are held
  5. Most number of winning team
  6. Is Toss Winner also the Match Winner
  7. Maximum Toss Winners
  8. Maximum Man Of Matches
  9. Visual representation of number of matches won by runs with respect to toss winner.

So, I will try to categorize the data by analyzing IPL matches data.

First of all, I’ve opened Jupyter notebook (it can be done in google colab also) and import pandas, numpy, matplotlib, seaborn libraries, and load the data set in a variable named, details. It will create a copy of whole data set in memory keeping the original file unchanged.

Fig:1- Importing libraries and top 5 sample rows

Here, details.head() is used to show top five rows of the data frame. Likewise details.tail() to retrieve last 5 rows, the default value is 5 for these. There is a column, named id, which has been used as index for our data frame.

We can check shape or size of our dataset, by details.shape(), so we have 636 rows and 18 columns and can have the information also of the dataset. As per the below snap, its clear that its pandas DataFrame with 636 entries between 0 to 635, contains 18 columns, with datatypes & not null columns and size of 89.6 KB.

Fig:2- Shape and Information

We can describe the dataframe to check count, min, max, standard deviation, 25%, 50%, 75% quartile value. Here it has been done in 2 different ways, first, we checked the important facts of the dataframe and second, with all the column values, where NaN is the null value, that means there will not be any mean,std value for umpire3 as its a categorical/qualitative data, similarly for city, date, team1, team2 etc. Also, top section will tell us maximum matches were played in Mumbai, most toss winner is Mumbai Indians, most toss decision is Field first, most of the result is normal, i.e, Duckworth-Lewis (D/L) method has not been applied etc.

Fig:3- Describe the dataframe

We can also check the Standard deviation value as the Standard deviation has a proportional relationship with outlier. Now look for mean and median(50%) of each column. If mean and median are equal or nearly equal then there will be no outlier. If mean>median then the distribution will be positive skewed or if mean<median then the distribution will be negative skewed. We can also check the quartiles to see if there are skewness/outliers.
IQR(Inter Quartile Range)=Q3-Q1=75%-25%
Upper limit= Q1–1.5*IQR
Lower limit= Q3+1.5*IQR
Any value beyond this limit is a outlier.
We could also see the differences of 25%-min, 50%-25%, 75%-50% and max-75% to understand the symmetry of the distribution.

Now, as we have seen that there are few null values, to check that, we can use details.isnull() and to check the how many null values are there for each columns we can use details.isnull().sum(). By using heatmap also, we can visualize that, its excellent features to visualize data when we have large datasets.

Fig:4- sum of null values & its visualization

So, for umpire3 columns, we have maximum null values, and for analysis purpose, we can remove umpire3 by executing details.drop(‘umpire3’,axis=1,inplace=True), here inplace=True is used to save the changes permanently on the data frame, axis=1 is for column, means we want to delete umpire3 column. Similarly, we can delete rows which has null values by executing, details.dropna(axis=0,inplace=True), this is useful for our data analysis.

Fig:5- Removed Umpire3 column and rows with null values

Now to check the no of matches, played per season, we can use, details[‘season’].value_counts() and also we can sort the values as per the requirement or interpretation and the same has been depicted in graphical way also. Here, we’ve used one column (season) to analyse, so this kind of analysis is called as Univariate analysis.

Fig:6- No of matches played per season

Now, we have to find out the team who won by maximum runs and maximum wickets in which season.

Fig:7- Team won by maximum runs
Fig:8- Teams won by maximum wickets

The most number of matches played in different cities:

Fig:9- No of matches played in different cities

The most number of wins in all the season till 2017:

Fig:10- Most number of wins

Now we need to check is the Toss Winner team is the Match winner or not by comparing the toss winner and winner of the dataset.

Fig:11- % to show if Toss Winner is Match winner

Maximum Toss Winners by which Team:

Fig:12- Maximum Toss Winners

Maximum Man of the Matches won by players:

Fig13:- Maximum MoM Awards

We have to analyse and find out data for a team, Kolkata Knight Riders, who won the number of matches by runs and with respect to toss winner and represent in in different plots

Fig:14- Different Types of plots

Conclusion:

We have analysed the data of IPL matches with the help of above explanation and visualization and can conclude that Mumbai Indians has done a great job so far. This kind of analysis can help cricket statisticians more and to all the cricket lovers.

Reference:

  1. https://www.kaggle.com/manasgarg/ipl

--

--