Practical Data Analysis with Pandas and SQL
Let us perform data analysis with both pandas and SQL and understand them practically.
Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool,
built on top of the Python programming language.
SQL stands for Structured Query Language. SQL lets you access data from RDBMS(Relational Database Management System) and can be used in data analysis.
Both Pandas and SQL are used in data analytics extensively.
In this blog, we will use pandas and SQL to analyze IPL(Indian Premiere League) data. Let us answer some of the questions through both SQL and pandas.
Understanding of the dataset:
Each row represents a ball bowled in IPL from 2008 to 2022. It has all the information like ‘batsman’, ‘bowler’, ‘batting_team’, ‘bowling_team’, etc.
match_id — This column uniquely identifies each IPL match.
runs_off_bat — This column gives the information on the number of runs scored by the batsman(excluding extras).
extras — Number of runs given as extras.
Each question is solved using SQL first and then using pandas.
1. How many balls are bowled in the history of IPL?
Since each row represents a ball bowled, we can count the number of rows to get the total number of balls bowled.
Query: select count(*) from ipl_ball_data;
Output: 225954
len(df)
Output: 225954
2. How many distinct players have batted in IPL?
Column ‘batsman’ has the name of the batsman. Unique values in this column give the required result.
Query: select count(distinct(batsman)) from ipl_ball_data;
Output: 605
df['batsman'].nunique()
Output: 605
3. How many sixes have AB de Villiers hit?
We can filter by the columns ‘batsman’ and ‘runs_off_bat’ to get the required result.
select count(*)
from ipl_ball_data
where batsman= 'AB DE VILLIERS' and runs_off_bat = 6;
output: 253
df[(df['batsman'] == 'AB de Villiers') & (df['runs_off_bat'] == 6)]['runs_off_bat'].count()
Output: 253
Get a list of the number of sixes scored in each venue (top 10)
SQL:
select venue, count(*) as num_of_sixes
from ipl_ball_data
where runs_off_bat = 6
group by venue
order by num_of_sixes desc
limit 10;
Pandas:
df[df['runs_off_bat']==6].groupby(['venue'])['runs_off_bat'].count().sort_values(ascending = False).head(10)
4. Get a list of sixes hit by AB De Villiers in each venue
We can use the columns ‘venue’ and ‘runs_off_bat’ and can use group by function.
SQL:
select venue , count(*) as num_sixes
from ipl_ball_data
where batsman= 'AB DE VILLIERS' and runs_off_bat = 6
group by venue
order by num_sixes desc;
Pandas:
df[(df['batsman'] == 'AB de Villiers') & (df['runs_off_bat'] == 6)].groupby(['venue'])['runs_off_bat'].count().sort_values(ascending =False).reset_index().rename(columns= {'runs_off_bat': 'sixes'})
5. Get a list of runs scored by Virat Kohli versus each team
We can get the information about the opposition in the column ‘bowling_team’.
SQL:
select bowling_team, sum(runs_off_bat) as runs
from ipl_ball_data
where batsman = 'V Kohli'
group by bowling_team
order by runs desc;
Pandas:
df[df['batsman']=='V Kohli'].groupby(['bowling_team'])['runs_off_bat'].sum().sort_values(ascending = False).reset_index().rename(columns={'runs_off_bat': 'runs'})
6. Get a list of the top 10 Highest team totals
Total runs include both runs_off_bat and extras.
SQL:
select sum(runs_off_bat + extras) as total_runs, batting_team
from ipl_ball_data
group by batting_team, match_id
order by total_runs desc
limit 10;
Pandas:
df['total_runs'] = df['runs_off_bat'] + df['extras']
df.groupby(['match_id', 'batting_team'])['total_runs'].sum().sort_values(ascending =False).head(10).droplevel(0).reset_index()
7. Get a list of the top 10 run scorers
Adding the values of runs_off_bat for each batsman will give the required result.
SQL:
select batsman, sum(runs_off_bat) as runs
from ipl_ball_data
group by batsman
order by runs desc limit 10;
Pandas:
df.groupby(['batsman'])['runs_off_bat'].sum().sort_values(ascending=False).reset_index().head(10)
8. Get a list of 2nd, 3rd and 4th highest run-getters.
SQL: Here we need to sort the batsman with their total_runs in descending order but should display from 2nd row, so we can use offset (offset 1 — means leave the first row, offset 2 — means leave the first 2 rows).
select batsman, sum(runs_off_bat) as total_runs
from ipl_ball_data
group by batsman
order by total_runs desc
limit 3 offset 1;
Pandas:
df.groupby('batsman')['runs_off_bat'].sum().sort_values(ascending =False).reset_index().loc[1:3]
9. Get the list of batsmen who have scored more than 1 century, with the number of centuries
SQL:
with cte as(
select sum(runs_off_bat) as Total_runs, batsman
from ipl_ball_data
group by match_id , batsman
having Total_runs>=100)
select batsman ,count(*) as num_of_100s
from cte group by batsman
having num_of_100s>1
order by num_of_100s desc;
Pandas:
batsman_runs = df.groupby(['match_id' , 'batsman'])['runs_off_bat'].sum().sort_values(ascending = False).droplevel(0).reset_index()
batsman_runs[batsman_runs['runs_off_bat']>=100].groupby('batsman')['runs_off_bat'].count().sort_values(ascending = False)
10. Get a list of bowlers with the highest dismissals against MS Dhoni.
select bowler, count(*) as num_wickets
from ipl_ball_data
where batsman = 'MS Dhoni' and
wicket_type in ('caught' , 'bowled' , 'lbw' , 'stumped' , 'caught and bowled' , 'hit wicket')
group by bowler
order by num_wickets desc limit 10;
11. Get a list of the highest run-getters in each season
This result can be achieved using windows and CTE in SQL and drop_duplicates in pandas
SQL:
with x as (
select batsman, season, sum(runs_off_bat) as runs
from ipl_ball_data
group by batsman, season
order by runs desc),
y as (select *, rank() over(partition by season order by runs desc)as rk from x)
select batsman , season, runs from y where rk = 1;
Pandas:
df.groupby(['batsman' , 'season'])['runs_off_bat'].sum().sort_values(ascending = False).reset_index().drop_duplicates(subset = ['season']).sort_values(by ='season').reset_index(drop=True)
12. Get a list of the top 10 wicket-takers
Bowlers will get the wicket’s credit only if the method of dismissal is one of the ‘caught’, ‘bowled’, ‘lbw’, ‘stumped’, ‘caught and bowled’, and ‘hit_wicket’.
SQL:
with cte as
(select *,
IF(wicket_type in ('caught' , 'bowled' , 'lbw' , 'stumped' , 'caught and bowled' , 'hit wicket'),
1,0) as bowler_wk from ipl_ball_data)
select bowler, count(*) as wickets
from cte where bowler_wk = 1
group by bowler
order by count(*) desc
limit 10;
Pandas:
def is_bowler_wiket(wicket_type):
if wicket_type in ['caught' , 'bowled' , 'lbw' , 'stumped' , 'caught and bowled' , 'hit wicket']:
return 1
else:
return 0
df['bowler_wicket'] = df['wicket_type'].apply(is_bowler_wiket)
df.groupby('bowler')['bowler_wicket'].sum().sort_values(ascending = False).head(10)
Thank You….