Python Visualization 101

Muturi David
9 min readMar 26, 2020

--

Photo by Clay Banks on Unsplash

Data visualization is usually the epitome of your data science project. A narrative of your data science project is captured from your visualizations. At the end of the day, your project should bring actionable insights that can help the business scale or solve an existing problem. In my career as a data scientist, I have come to learn that, if a project can not be translated to implementable and feasible actions, it’s a failed project. Regardless of the fancy codes and algorithms used. Clear and catch visualization is the way to your boss’s head.

if you desire to make business executives and relevant consumers of your insights excited and eager to act on your project, then delivering your insight visuals that capture both their mind and emotions is not an option. As a common adage goes, a picture is worth a 1000 words. People are quick to discard what they don’t understand. Therefore, a good visualization is mandatory for all successful data science project for it makes people connect to your project with ease.

Data Visualization is simply interpreting and transforming your data into visual contexts such as graphs, maps and summary tables. Data visualization makes big data and complex data science projects easy to understand and much interesting. Through data visualization, it’s easy to observe and explore patterns, trends and correlations that might not otherwise be detected or observed.

There are over 20 common types of visualization charts, both in 2D and 3D, that one can use to tell the story from data. Python offers multiple great graphical libraries that are packed with lots of different features to allow one to create beautiful and informative charts. With python’s visualization modules, one can easily create basic visualizations, interactive visualizations and even customize them into a dashboard.

In this blog, I hope to give you a smooth introduction to visualization in python. We shall look into the three most common visualization packages in python. That is; Pandas, Matplotlib and Seaborn in Jupyter Notebook. We shall use students data which you can access here.

Enjoy!

Importing the packages to be used

Before loading the packages, it’s always a good habit to set you working directory. I recommend setting your working directory to the folder that contains the resources -datasets- for your project. use cd <path to your folder> to change your directory.

Now we import all the packages we require using the importfunction.

##Importing the  Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Importing Profiling package
from pandas_profiling import ProfileReport
#setting up the graphycal environment
%matplotlib inline
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
  • Pandas — Pandas is the module we require to perform data manipulation and work with data frames. Pandas will offer some base visualization functionalities using the plotfunction.
  • Numpy — Numpy is a powerful python package that offers a wide functionality when it comes into carrying out some arithmetic operations.
  • Matplotlib — Matplotlib is a low-level visualization library in python for creating static, animated, and interactive charts.
  • Seaborn — Seaborn is a visualization library based on matplotlib that provides a high-level, attractive and informative visualizations.
  • Warnings — Warnings is used to manage warning messages in python. In our case, we chose to silence them by writing warnings.filterwarnings('ignore')
  • Pandas Profiling — This like a magical package to me. It’s very helpful in carrying out some quick EDA(exploratory data analysis). we shall use pandas profiling to generate an HTML summary report-including descriptive statistics- from our data. If you don't have pandas profiling installed, you can install it using pip package manager by running pip install pandas-profiling

Loading the Dataset

Use pandas you load data in several common formats using the read command. in our case, we shall use read_csv() method to read our students data since its in CSV format.

This is hypothetical data of students lifestyle and academic performance in a certain university in Kenya.

Using Pandas profiling to perform EDA

Pandas Profiling requires just one line of code to generate a comprehensive report with statistical summaries that helps to understand the dataset, its variables and relationships between them. The HTML report is interactive and can easily be shared with others even if they don’t know how to code.

Let’s generate a report from our data. Just type the code below.

ProfileReport(df)

With that one line of code, you can see a full report of missing values in the dataset, numerical variables statistics, frequencies summaries for categorical variable and relationships.

Pandas Visualization

As mentioned, Pandas is rich in data manipulation and analysis functions. Amongst this function is theplot function that you can use to make beautiful graphs. Let us look into the most common plots.

1. Scatter Plot

A scatterplot is a type of data display that shows the relationship between two numerical variables. To produce one using pandas, use plot function and specify kind to be scattered.

df.plot(x='Age',y='ApproxHeight',kind='scatter',
title='Students Age v/s Height',figsize=(10,6))

alternatively, you can use the following code to produce the same Pandas chart.

df.plot.scatter(x='Age',y='ApproxHeight',
title='Students Age v/s Height',figsize=(10,6))

2. Line Graph

A line graph is common in time series, it’s used to visualize the change of a variable over time. Lets produce a line chart for age, approximate height and approximate weight from our data. Use loc function to select our variables of interest and plot function where kind='line'

plot_data=df.loc[:,["Age","ApproxHeight","ApproxWeight"]]plot_data.plot(kind='line',figsize=(12,6),
title="Sample Line graphs from Students data")

3. Histogram

A histogram is an approximate representation of the distribution of numerical or categorical data. Producing one in pandas is as easy as the code below.

df['Expense_Accommodation'].hist(bins=10)

You can also do multiple histograms where you want to visualize two or more numerical variables distributions in one chart. Just add subplot=True and define your layout. For instance, lets visualize both accommodations and total semester expenses.

plot_data=df.loc[:,["Expense_Accommodation","Expense_Semester"]]plot_data.plot(subplots=True, layout=(1,2),figsize=(12,5),bins=20,kind='hist')

3. Bar Charts

A bar graph is used to presents categorical data with rectangular bars whose heights are proportional to their values. Let us create one from the previous grade variable.

df['Previous_Exam_MeanGrade'].value_counts().sort_index().plot(kind='bar',figsize=(7,5))

you can also make it horizontal by specifying that kind='barh'

Matplotlib Library

making plots using matplotlib is slightly different from pandas. Matplotlib give you much flexibility for customizing your charts at the expense of writing more codes. we shall produce the same charts we did with pandas with matplotlib.

1. Scatter Plots

To create a scatter plot in Matplotlib we use the scatter function. We will first have to create a figure and an axis using plt.subplotthen define our plot a title and labels.

# create a figure and axis
fig, ax = plt.subplots(figsize=(11,5))
# scatter the sepal_length against the sepal_width
ax.scatter(df['Age'], df['ApproxHeight'])
# set a title and labels
ax.set_title('Students Dataset')
ax.set_xlabel('Age')
ax.set_ylabel('Weight')

2. Line Charts

Line charts in matplotlib are produced in the same manner as a scatter plot. In our case, we use the matplotlib.pyplot plot function to plot the line charts. We need to plot the same age, approximate weight and approximate height charts as in pandas. Here we shall use a for loop to draw each of the 3 line charts at a time.

3. Histogram

fig, ax = plt.subplots()
ax.hist(df['Expense_Accommodation'])
ax.set_title('Accomodation Expenses Distribution')
ax.set_xlabel('Accomodation Expenses')
ax.set_ylabel('Frequency')

Seaborn

Seaborn is just fantastic. It is built on matplotlib and it allows us to make some prettier charts with fewer lines of codes. You gonna love it.

1. Scatter plot

in seaborn, you specify the size of your charts using matplotlib figure function. to make a scatter plot we just use seaborns scatterplot function. Note the argument; x-variable in x_axix, y- variable in y_axis,data-state your plotting dataset, hue- grouping variable. We use set_title function after creating the chart to title our graph.

plt.figure(figsize=(11,5))sns.scatterplot(x='ApproxHeight',y='Expense_Semester',
data=df,hue="YearofStudy").set_title("Scatter plot for expenses against the year of joining campus")

2. Regression plots

Regression plots are used to visualize the relationship between two numerical variables as determined through regression. regplot draw a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression.

let's make a regression plot to visualize the relationship between approximate weight and approximate height

plt.figure(figsize=(10,5))
sns.regplot(x="ApproxHeight",y='ApproxWeight',data=df)
regression plot

3. Histogram with a density plot

With seaborn, its pretty easy to make a histogram with a desity plot. just use distplot function where kde=True

Histogram with a density plot

4. Barplot

Bar plots in seaborn are also very direct and easy to plot. Let us visualize the year of KCSE variable

5. Box Plots

A box plot is used to show the distribution of data based on five statistic summaries. minimum, first quartile, median, third quartile and maximum. Let plot a box plot for approximate weight distribution grouped by the year of study. we just need to use sns.boxplot function

Thats pretty much what you need to know to get started with making impressive charts in python. But it will be nice to show how to create some other useful but not very popular charts in python

Bonus

Heatmaps

Heatmaps are some cool charts used to visualize a numerical variable using colour scales. Wikipedia defines heatmaps as a graphical representation of data where the individual values contained in a matrix are represented as colours.

Let us use seaborn to create a heatmap to visualize the correlations between our numerical variables.

plt.figure(figsize=(10,6))## selecting the data with variables of interest  
plt_data=df.loc[:,['Age','ApproxHeight','ApproxWeight',
'Expense_Semester','Expense_Accommodation']]
## Creating a heatmap
sns.heatmap(plt_data.corr(),annot=True)

each cell represents a correlation value between the two interacting variables. The colour index is shown by the bar on the right where the darker the colour the lower the correlation.

Seaborn Pairplots

Pair plots come in handy when carrying out EDA. They allow us to visualize the distribution of a single variable and relationship between two variables in a single matrix chart of histograms and scatterplots

You just need one line of code to produce one

sns.pairplot(plt_data)

You can get the entire script here.

Recommendation

To enrich your data science skillset, I highly recommend you check out this cool site https://perfectresearchconsultancy.com/

--

--