Data Visualization using Matplotlib and Seaborn
What is Data Visualization?
Today’s world a lot of data is going everywhere. The data is getting increasing every day. We can see in real-time right from the mobile. Using social media, Mails, Bank transactions keep increasing day by day. Is it possible to view the massive data in the formal way of representation? Yes, we can do via data visualization. The data visualization is a graphical representation of data. In the big data world, there are several data visualization tools capable of analysing the massive data used for decision making.
Today, we are going to implement data visualization using a dataset from UCI.
Let’s start…
Download the dataset from the below link.
Load the data in dataFrame
df = pd.read_csv(r'\Dataset\bank-full.csv', sep=';')
df
Bank marketing dataset
Matplotlib
Histogram
A histogram is a chart that groups numeric data into bins, displaying the bins as segmented columns. They’re used to depict the distribution of a dataset: how often values fall into ranges.
Using the histogram, we can see the people’s around ~39000 hold balance of 0. Where 5000 peoples hold the balance of ~15000.
df['balance'].plot(kind='hist')
HIstogram across the pricing parameters
PieChart
To determine which job getting highly paid.
Need to create the dataset basically grouping on the job and balance as below
df_group = df.groupby(['job'])['balance'].sum()
Plotting the data using a pie chart
df_group.plot.pie(figsize=(10,20), autopct="%.2f")
The management person is highly paid and secondly blu-collar job.
Another interesting one, Determine the number of people who take a loan
df['loan'].value_counts(normalize=True).plot(kind='pie', autopct="%.1f")
The person around 16% taken a bank loan.
Count
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
df.job.value_counts().plot(kind='barh')
Who is taking most of the job or job in demand? Clearly, blue-collar job taking most of it.
BarChart
Determine under which eduction pays the better balance.
df_loan = df.groupby(['education'])['balance'].sum().reset_index()df_loan.plot.bar(x='education', y='balance')
From the plot, the secondary & Tertiary plays the better balance
Seaborn
Catplot
This function provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations.
From the dataset, firstly group the person having thee balance which is greater than 100
greater_100_balance = df[df['balance'] > 100]
greater_100_balance
Having balance > 100
Distribute the balance across the months for each marital status as below.
sns.catplot(x='month', y='balance', col='marital', data=greater_100_balance, kind='bar')
Pairplot
Plot pairwise relationships in a dataset.
By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
sns.pairplot(df)
Countplot
Show the counts of observations in each categorical bin using bars.
sns.countplot(x='housing', data=df)
Determine the number of persons having owned houses.
Scatterplot
The relationship between x
and y
can be shown for different subsets of the data using the hue
, size
, and style
parameters. These parameters control what visual semantics are used to identify the different subsets. It is possible to show up to three dimensions independently by using all three semantic types, but this style of plot can be hard to interpret and is often ineffective. Using redundant semantics (i.e. both hue
and style
for the same variable) can be helpful for making graphics more accessible.
sns.scatterplot(x='age', y='balance', data=df)
The person who aged 50 having a higher balance. The age between 50–60 can see the peak in balance.
Relplot
The relationship between x and y can be shown for different subsets of the data using the hue, size, and style parameters. These parameters control what visual semantics are used to identify the different subsets. It is possible to show up to three dimensions independently by using all three semantic types, but this style of plot can be hard to interpret and is often ineffective. Using redundant semantics (i.e. both hue and style for the same variable) can be helpful for making graphics more accessible.
sns.relplot(x='day', y='balance', data=df, hue='month')
On 3rd of June deposited higher balance comparative to November month.
Jointplot
Draw a plot of two variables with bivariate and univariate graphs.
sns.jointplot(x=df['day'], y=df['balance'])
The above illsutration implemented using joinyplot.
References: