End-to-End Data Analysis Project Using Streamlit on Cricket T20 Worldcup 2022

12 min readFeb 8, 2023

Image Source: https://www.gemini-us.com/industries/how-data-analytics-is-changing-sports/

Web Application of Data Analysis of T20 world cup 2022 can be accessed through this link https://analysisont20worldcup2022.streamlit.app/

Hello people, I hope everyone is doing well. Today in this blog, we will Analyze data from T20 Worldcup 2022 using streamlit.

The excitement of the T20 World Cup 2022 has taken the world by storm and left fans on the edge of their seats. The thrilling matches, the nail-biting finishes, and the remarkable performances by players worldwide have captured the imagination of fans everywhere. But what does the data tell us about this year’s T20 World Cup? What insights and trends can we uncover by analyzing the tournament end-to-end?

In this blog, we take a deep dive into the T20 World Cup 2022 using the power of Streamlit. Through interactive visualizations, data analysis, and exploratory insights and explore what made it so memorable. Join us on this exciting journey as we explore the T20 World Cup 2022.

An end-to-end data analysis project typically involves the following steps:

Problem Definition: Define the problem you want to solve and determine the goals of the project.
Data Collection: Gather and organize the data needed for the analysis. This could involve scraping data from the web, accessing databases, or using publicly available data sets.
Data Cleaning: Clean and preprocess the data to remove missing values, incorrect values, and outliers.
Data Exploration: Explore the data to gain insights, and identify patterns and relationships.
Communication: Communicate the results and insights from the analysis in a clear and concise manner, such as through visualizations, reports, or presentations.
Deployment: Deploy the models in a production environment to be used for real-world decision-making.

The full code of this project is in my GitHub and You can access this link https://github.com/AFFANSKM14/Analysis_on_T20WorldCup_using_Streamlit

Problem Definition

To gain insights and understanding of the performance of teams and players during the T20 World Cup 2022 and this project is entirely developed on python.

Data Collection

In this Project, we have used Web Scraping to Collect Data. Web scraping is a technique used in data collection to extract data from websites. It involves using automated tools, such as web scraping software or scripts, to extract data from websites and save it in a structured format for further analysis. However, it is important to be mindful of ethical and legal considerations, such as respecting website terms of use and privacy policies and ensuring that the data collected is accurate and reliable.

The URL which is used to collect the data for this project is stats.espncricinfo.com, from this website, you can search for T20 world cup 2022 and after clicking that you will get a page like this.

In this project, Web Scraping is done by using two libraries in python which are requests and BeautifulSoup.

requests : The requests module in Python is a popular library for sending HTTP requests. It allows you to send HTTP requests and handle the responses from the server
BeautifulSoup: The BeautifulSoup module in Python is a library used for web scraping and parsing HTML and XML documents. It provides a convenient way to extract and navigate the structure of web pages, allowing you to easily access and extract data from the page.

The function which is used in the code for the above two modules are

def beautifulsoup_func(url):
    requests_object = requests.get(url) # url specifies location of webpage 
    source_code = requests_object.text
    soup = BeautifulSoup(source_code,'html.parser')
    return soup

Now after accessing the webpage we need to extract data from tables.

Table elements in HTML are used to present tabular data, such as spreadsheets, on a web page. Tables are created using a set of HTML tags, including <table>, <tr>, <td>, and <th>.

Here’s a brief overview of the main table elements in HTML:

<table>: This tag defines a table and contains the entire table structure.
<tr>: This tag defines a table row. Each row in a table is enclosed with a <tr> tag.
<td>: This tag defines a table cell. Table cells contain the data that makes up the rows and columns of a table.
<th>: This tag is used to define a header cell in a table. Header cells are typically used to provide labels for the columns or rows in a table.

First, we will extract the table headers from the webpage and the following code is

def get_tableheader():
    table_header = soup.find_all('th')
    header = [h.text for h in table_header]
    return header

We are going to extract three total tables and they are

Match Summary
Batting Summary
Bowling Summary

The webpage which contains the match summary table is here and within the match summary table, the webpages of both batting and bowling summary are embedded in the scorecard attribute through a hyperlink.

The href attribute in HTML is used to specify a hyperlink reference in a web page. It is most commonly used with the <a> tag to create links to other web pages or resources. The href attribute specifies the URL of the resource that the link points to.

Code snippet for collecting data of match summary in a DataFrame object and also storing the hyperlinks which will be used further for scraping batting and bowling summary tables.

soup = beautifulsoup_func(url)  # requesting and parsing url with requests and beautifulsoup

header = get_tableheader()    # getting table headers

table_row = soup.find_all('tr',class_='data1')  # parsing table rows

summary_df = pd.DataFrame(columns = header)  #creating empty dataframe with columns as table headers

# Iterating through table data and adding it into dataframe

for row in table_row:
    data = row.find_all('td')
    row_data = [d.text for d in data]
    href = [str(link.get('href')) for link in soup.find_all('a',class_ = 'data-link') if 'match' in str(link.get('href'))]
    length = len(summary_df)
    summary_df.loc[length] = row_data

Similarly, we will extract the data for batting and bowling summary by iterating through the href which we collected from the above code.

Data Transformation and Enrichment:

Sometimes data we collected from the internet might not be sufficient and appropriate.

Data Transformation means changing the name of headers and Data -Enrichment is adding extra columns to the table. For example in the batting summary table and bowling summary table, we may not know from which country the batsman/bowler belongs. So In this case adding an extra column country will boost the data for getting more meaningful insights.

batting_df.rename(columns = {'\xa0':'Dismissal','BATTING':'Batter','R':'Runs','B':'Balls','M':'Minutes'},inplace = True)


bowling_df.rename(columns = {'BOWLING':'Bowler','O':'Overs','R':'Runs','M':'Maidens','W':'Wickets'},inplace = True)

Data Cleaning

Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing inaccuracies, inconsistencies, and errors in data. This step is crucial in data analysis, as even small errors in the data can lead to incorrect results or insights.

Some common data-cleaning tasks include:

Removing duplicate records: This involves identifying and removing duplicate records from the data.
Handling missing values: This involves either removing records with missing values or imputing missing values based on the other data in the record or the overall data set.
Correcting inconsistent data: This involves identifying and correcting inconsistent data, such as correcting inconsistent date formats or converting text to a standard format.
Removing outliers: This involves identifying and removing extreme values in the data that may be errors or may not be representative of the overall data.

code snippet of data cleaning which was done in this project

# replacing empty columns with Nan values
match_summary['Margin'] = match_summary.Margin.replace('', 'NaN')

# dropping unnamed column
batting_summary = batting_summary.drop("Unnamed: 0", axis=1)
bowling_summary = bowling_summary.drop("Unnamed: 0", axis=1)
match_summary = match_summary.drop("Unnamed: 0",axis=1)


# Converting Minutes and SR into integer and float respectively
batting_summary['Minutes'] = batting_summary['Minutes'].str.replace('-', '1')
batting_summary['SR'] = batting_summary['SR'].str.replace('-', '0')

batting_summary['Minutes'] = batting_summary['Minutes'].astype(int)
batting_summary['SR'] = batting_summary['SR'].astype(float)

After collecting the data and cleaning it the three tables looks like

match_summary.head()

batting_summary.head()

bowling_summary.head()

Data Exploration and Communication

Data exploration, also known as data discovery, is the process of analyzing and summarizing data to identify patterns, relationships, and trends.

Communication is an important aspect of data analytics as it helps to effectively convey the results and insights of the analysis to the relevant stakeholders. Good communication in data analytics involves clearly and concisely presenting the results and insights in a way that is easily understandable to the intended audience.

At first, we will explore the possibility of calculating the winning percentage of the match by batting first, If the percentage of winning by batting first is high then we can clearly say that there is an effect of dew in the second Innings

def bat_first_win_pct(df):
    bat_count = 0
    bat_team = [] # to get the teams which won by batting first
    bowl_count = 0
    bowl_team = [] # to get the teams which won by bowling first
    for i, j in enumerate(df['Margin']):
        if type(j) in (list, tuple, dict, str):
            if 'runs' in j:
                bat_count += 1
                bat_team.append(df['Winner'][i])
            elif 'wickets' in j:
                bowl_count += 1
                bowl_team.append((df['Winner'][i]))
        else:
            pass
    total_count = bat_count + bowl_count
    bat_pct = (bat_count / total_count) * 100
    bat_pct = round(bat_pct, 2)
    bat_pct = str(bat_pct) + '%'
    bowl_pct = (bowl_count / total_count) * 100
    bowl_pct = round(bowl_pct, 2)
    bowl_pct = str(bowl_pct) + '%'

    return bat_team, bowl_team, bowl_pct, bat_pct

After using the above function we got about 55% of the matches were won by batting first, So we can clearly say that winning the toss was not much biased.

Now, Let's take a look at teams which are good at defending the totals and also the teams which are good at chasing the totals. This kind of analysis might tell us how strong their batting and bowling lineup is.

Here’s the code snippet which we used in this project to determine which teams are stronger in defending and chasing.

def top_teams_on_chasing_defending(bat_team, bowl_team):
    temp1 = bat_team
    temp2 = bowl_team
    temp1 = list(set(temp1))
    temp2 = list(set(temp2))
    top5_def = []
    team1 = []
    team2 = []
    top5_chs = []
    for t in temp1:
        c = bat_team.count(t)
        if c > 1:
            team1.append(t)    
            top5_def.append(c)
    for t in temp2:
        c = bowl_team.count(t)
        if c > 1:
            team2.append(t)
            top5_chs.append(c)
    return top5_def, top5_chs, team1, team2

Here is the output of the above code when plotted with the help of the library Plotly.

So by looking at the above two bar charts we can clearly say that India and New Zealand were good while defending targets and England was good while chasing targets.

Relationship between numerical attributes of batting and bowling summary tables

Well, we have around seven numerical attributes in the batting summary table and they are batting_position, Runs, Balls, Minutes, 4s, 6s, SR(strike rate)

For determining the relationship between two variables we are using spearman’s correlation. Spearman’s correlation is a non-parametric statistical measure used to assess the strength and direction of the relationship between two variables. It is based on the rank order of the values of the variables, rather than their raw values. It ranges from -1 to 1, where -1 indicates a strong negative relationship, 1 indicates a strong positive relationship, and 0 indicates no relationship. Unlike Pearson’s correlation, Spearman’s correlation can be used for ordinal or continuous data, and it is robust against outliers.

def corr(df):     #df refers to the dataframe
   corr = df.corr(method='spearman')
   return sns.heatmap(corr, annot=True) #sns is an alias for seaborn library

from the above heatmap, we can clearly say that all the values regarding the batting_position attribute are negative. Hence if the batting order increases the other attributes are going down. But except for the batting_position attributes each and every other attribute is in positive correlation.

The negative relation between attributes in bowling summary:

As the Economy increases then the number of 0s and Wickets decreases

The positive relation between attributes in bowling summary:

As the Economy increases then the number of 4s and 6s increases
Runs given by the bowler increase then the number of 4s and 6s increases
As the number of overs increases then Wickets, 0s, and Runs increase

Impact Player of the tournament

An impact player in the T20 cricket format refers to a player who can significantly impact the outcome of a match through his skill and performance. This can be achieved through various means such as quick runs in key moments of the game.

A player should have a strike rate of above 180 and the minimum balls faced should be 80 then the player would be considered an impact player of the tournament.

# Impact player
Impact_Player = batting_summary.groupby(['Batter'], as_index=False) \
    .agg({'Runs': 'sum', 'Balls': 'sum', 'Minutes': 'sum', 'SR': 'mean'})

filtered_values = np.where((Impact_Player['SR'] >= 180) & (Impact_Player['Balls'] > 80))
filtered_df = Impact_Player.loc[filtered_values]

After running the above code we only got one player Surya Kumar Yadav from INDIA. Now for Plotting this we will use a radar plot. A radar plot can describe many attributes of a single feature.

Innings = list(range(1, len(batter_metric) + 1))
runs = list(batter_metric['Runs'])
SR = list(batter_metric['SR'])

chart_df = pd.DataFrame(
   {'innings': Innings, 'runs': runs, 'strike_rate': SR, 'Balls': list(batter_metric['Balls'])})
fig = px.line(chart_df, x='innings', y='runs', title='Runs vs SR vs Balls', animation_group='runs')
fig.add_scatter(x=chart_df['innings'], y=chart_df['strike_rate'], mode='lines', name='Strike Rate')
fig.add_scatter(x=chart_df['innings'], y=chart_df['Balls'], mode='lines', name='Balls')

Radar Plot of performance of Surya Kumar Yadav in the tournament

As we can see from the above radar plot Surya Kumar Yadav has scored above 200 runs with fewer balls to face keeping the strike rate (SR) above 180.

Top 10 Performers in batting and bowling in each category

Here categories which include batters are Most Runs, Most 4s, Most 6s, Best Strike Rate, Most Balls Faced, and Number of Minutes Stayed on Crease.

Categories which include bowlers are Most Wickets, Most 4s Given, Most 6s Given, Best Economy, Most Dot Balls, Most Maidens, Runs Leaked and Extras

def batting_stats(attr):
    if attr != 'SR':
        overall_data = batting_summary.groupby(['Country', 'Batter'], as_index=False).agg({attr: 'sum'})
        sorted_data = overall_data.sort_values(by=attr, ascending=False)
        top_10_bat_data = sorted_data[:10]
    elif attr == 'SR':
        overall_data = batting_summary.groupby(['Country', 'Batter'], as_index=False).agg(
            {attr: 'mean', 'Balls': 'sum'})
        filtered_data = overall_data[overall_data['Balls'] > 20]
        sorted_data = filtered_data.sort_values(by=attr, ascending=False)
        top_10_bat_data = sorted_data[:10]

    return top_10_bat_data


def bowling_stats(attr):
    if attr != 'ECON' and attr != 'WD':
        overall_data = bowling_summary.groupby(['Country', 'Bowler'], as_index=False).agg({attr: 'sum'})
        sorted_data = overall_data.sort_values(by=attr, ascending=False)
        top_10_bowl_data = sorted_data[:10]
    elif attr == 'WD':
        overall_data = bowling_summary.groupby(['Country', 'Bowler'], as_index=False).agg({attr: 'sum', 'NB': 'sum'})
        sorted_data = overall_data.sort_values(by=attr, ascending=False)
        top_10_bowl_data = sorted_data[:10]
    elif attr == 'ECON':
        overall_data = bowling_summary.groupby(['Country', 'Bowler'], as_index=False).agg({attr: 'mean', 'Overs': sum})
        filtered_data = overall_data[overall_data['Overs'] > 15]
        sorted_data = filtered_data.sort_values(by=attr, ascending=True)
        top_10_bowl_data = sorted_data[:10]
    return top_10_bowl_data

Individual Performance of a Player

Now we will see the performance of each and every match of a particular player.

def Batter_Perf(option):
    df = batting_summary[batting_summary['Batter'] == option]
    return df

def Bowler_Perf(option1):
    df = bowling_summary[bowling_summary['Bowler'] == option1]
    return df

Deploying our Findings into Web Application using Streamlit

Streamlit is an open-source Python library that makes it easy to build beautiful and interactive web-based applications for data science and machine learning. It allows developers to create custom UI elements and interactive visualizations with just a few lines of code, eliminating the need for JavaScript or other front-end web development skills. It’s designed to make it simple for data scientists and machine learning engineers to share their work and insights with others, without having to worry about the complexities of web development.

Some of the methods in streamlit used to create this web application

streamlit.columns is a function in the Streamlit library used to layout components within a Streamlit app. The streamlit.columns function creates a row of columns within the app, and each column can contain a single component.
streamlit.set_page_config is a function in the Streamlit library used to set configuration options for the current page of a Streamlit app.
streamlit.header is a function in the Streamlit library that creates a header element within a Streamlit app. A header is typically used to provide a title or descriptive text at the top of a page or section of the app.
streamlit.pyplot is a function in the Streamlit library that integrates the popular matplotlib library for plotting into Streamlit. The streamlit.pyplot function creates a visual plot within the Streamlit app, based on the matplotlib syntax.
streamlit.img is a function in the Streamlit library that allows you to display an image within a Streamlit app

Web Application of Data Analysis of T20 world cup 2022 can be accessed through this link https://2022-t20worldcupexplorer.streamlit.app/

Some of the screenshots of the web app are