Fundamentals of Data Visualization in python

azam sayeed
Analytics Vidhya
Published in
5 min readSep 28, 2019
  • Data Visualization is the representation of data graphically or pictorially. Allows high-level Representatives to see analytics, grasp difficult concepts and identify new patterns at ease.

Ex: Kmeans clustering, Its easier to understand the cluster labeling done by an algorithm of data.Not possible to label clusters just by checking raw data.

Famous Anscombe’s Quartet Example explains the importance of Data Visualization. Python code is readily available.

Matplotlib in python

  • open-source python lib for visualization
  • creates graphs and plots using a python script. It allows for saving plots in local systems. Provides Object-oriented API

plt.savefig(‘ScatterPlot.png’)

  • It has a Module called Pyplot, which has simple functions used for visualization. (line, images, text, labels, etc)
  • Supports a wide range of graphs

Line plot, Bar plot, Scatter Plot, Histogram

Image Plot, Box plot, Violin Plot, Stream plot, Quiver plot, Area Plot, Peter Plot, and Donut Plot

  • Easy integration with Pandas and Numpy.

Line Plot

import numpy as np
import matplotlib.pyplot as plt
# allows plot to be display below the notebook
%matplotlib inline
#defining the dataset
x=np.arange(0,10,0.1)
y=3*x+5
#plotting the datapoints
plt.plot(x,y)
plt.show()

Customizing Line plots (Compare with basic line plot)

import numpy as np
import matplotlib.pyplot as plt
# allows plot to be display below the notebook
%matplotlib inline
#defining the dataset
x=np.arange(0,10,1)
y=3*x+5
#plotting the datapoints
plt.plot(x,y,linewidth =2.0 , linestyle =":",color ='y',alpha =0.7, marker ='o')
plt.title("Line Plot Demo")
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.legend(['line1'], loc='best')
plt.grid(True)
plt.show()

Figure Size

...
y=3*x+5
#changing the figure
fig=plt.figure(figsize=(10,5))
#plotting the datapoints
...

subplots

import numpy as np
import matplotlib.pyplot as plot
%matplotlib inline
x=np.arange(0,10,1)
y1=2*x+5
y2=3*x+10
plt.subplot(2,1,1) #A
# B - plt.subplot(1,2,1) #(height,width,column)
plt.plot(x,y1)
plt.title('Graph1')
plt.subplot(2,1,2) #A
# B - plt.subplot(1,2,2)
plt.plot(x,y2)
plt.title('Graph2')
plt.show()

Bar plot

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = {'apples':20,'Mangoes':15, 'lemon':30,'Oranges':10}
names =list(data.keys())
values =list(data.values())
plt.subplot(3,1,1)
#fig =plt.figure(figsize =(10,5))
plt.bar(names,values,color ="orange")
plt.title("Bar Graph Demo")
plt.xlabel("Fruits")
plt.ylabel("Quantity")
plt.subplot(3,1,3)
plt.barh(names,values,color ="orange")
plt.title("Bar Graph Demo")
plt.xlabel("Fruits")
plt.ylabel("Quantity")
plt.show()

Scatter Plots

import matplotlib.pyplot as plt
%matplotlib inline
#dataset Note - a and y1,y2 should be of same size
a=[10,20,30,40,50,60,70,80]
y1=[2,3,5,6,1,4,5,3]
y2=[1,2,3,4,5,5,1,3]
plt.scatter(a,y1)
plt.scatter(a,y2)
plt.show()

Customizing Scatter plots compare with the basic plot

import matplotlib.pyplot as plt
%matplotlib inline
#dataset Note - a and y1,y2 should be of same size
a=[10,20,30,40,50,60,70,80]
y1=[2,3,5,6,1,4,5,3]
y2=[1,2,3,4,5,5,1,3]
plt.scatter(a,y1, c='g',s=300,edgecolors='y',marker='o',alpha=0.5)
plt.scatter(a,y2, c='y',s=400,edgecolors='b',marker='3',alpha=1)
plt.legend(['y1','y2'],loc='best')
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.grid(True)
plt.show()

Histogr.am

  • A histogram is a graphical display of data using bars of different heights. In a histogram, each bar group numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data.
import matplotlib.pyplot as plt
%matplotlib inline
numbers = [10,90,12,16,19,12,20,26,28,30,38,35,34,45,60,68,64,62,70,78,75,79,85,94,95]
plt.hist(numbers,bins=[0,20,40,60,80,100], color='#FFF233',edgecolor='#000000')
plt.title("Histogram Demo")
plt.grid(True)
plt.xlabel("Range of values")
plt.ylabel("Freq of values")
plt.show()

Box Plot and Violin Plot

  • box plot helps to Analyse data efficiently and does the outer analysis of data such as outlier, Quartile, etc
  • a violin plot is used for large amounts of data, where the individual representation of data is not possible.
import matplotlib.pyplot as plt
%matplotlib inline
#data
total = [20,4,1,30,20,12,20,70,32,10]
order =[10,3,2,15,17,2,30,44,2,1]
discount = [30,10,20,5,10,20,50,60,20,45]
data = list([total, order, discount])
print(data)
plt.boxplot(data,showmeans =True)
plt.title("Box plot Demo")
plt.grid(True)
plt.show()
import matplotlib.pyplot as plt
%matplotlib inline
#data
total = [20,4,1,30,20,12,20,70,32,10]
order =[10,3,2,15,17,2,30,44,2,1]
discount = [30,10,20,5,10,20,50,60,20,45]
data = list([total, order, discount])
print(data)
plt.violinplot(data,showmeans =True, showmedians=True)
plt.title("Violin plot Demo")
plt.grid(True)
plt.show()

Pie Chart, Donut Chart

import matplotlib.pyplot as plt
%matplotlib inline
#prepare the dataset
label=['Dog','Cat','Wolf','Lion']
sizes=[50,45,60,80]
plt.pie(sizes,labels =label)
plt.title("Pie Chart DEmo")
plt.show()

Customization

import matplotlib.pyplot as plt
%matplotlib inline
#prepare the dataset
label=['Dog','Cat','Wolf','Lion']
sizes=[50,45,60,80]
#add colors
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
plt.pie(sizes,labels =label, colors =colors,autopct='%1.1f%%' ,shadow=True ,startangle = 90, explode=(0,0.1,0,0))
plt.title("Pie Chart Demo")
plt.show()
#Donut plot
import matplotlib.pyplot as plt
%matplotlib inline
group_names = ["GroupA","GroupB","GroupC"]
group_size=[20,30,50]
size_centre = [5]
#colors
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
pie1 =plt.pie(group_size, labels = group_names,radius =1.5,colors =colors)
pie2 = plt.pie(size_centre,radius =1.0,colors ='w')
plt.show()

Area Plots

  • similar to Line plot, only difference Area under the slope is colored
import matplotlib.pyplot as plt
%matplotlib inline
#dataset
x=range(1,17)
y=[1,4,6,8,4,5,3,8,8,8,4,1,5,6,8,7]
plt.stackplot(x,y)
plt.show()

few customizations

import matplotlib.pyplot as plt
%matplotlib inline
#dataset
x=range(1,17)
y=[1,4,6,8,4,5,3,8,8,8,4,1,5,6,8,7]
plt.stackplot(x,y, colors ='green', alpha =0.5)
plt.plot(x,y, color='g')
plt.grid(True)
plt.show()

More Examples using pandas

DataSet: https://www.dropbox.com/s/v3ux6vy7ajvltz0/Customerdata.csv?dl=0

  1. Build a box-plot for the dataset. x-axis — Contract type, y-axis- count
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
customer= pd.read_csv(r'Customerdata.csv')
grp=customer.Contract.value_counts()
x=grp.keys()
y=grp.values
print(type(grp.values))plt.bar(x,y,color ="orange")
plt.title("Distribution of Contract in dataset")
plt.xlabel("Contract Type of Customer")
plt.ylabel("count")
plt.show()

2. Build a Histogram. x-axis: Monthly Charges Incurred, y-axis: count

3. Build scatter plot between TotalCharges(x-axis)vs Tenture(y-axis).

NOTE: Kernel keeps hanging for scatter plot visualization, restart and ensure too much data not used.

4. Build Box-plot .x-axis: Payment Method of Customer and y-axis: Monthly Charges incurred. There are 3 ways of Payment: Electronic Check, Mailed check and Bank transfer

Try it out yourself :)

Hint :

a=Customer[Customer['PaymentMethod']=='Electronic Check']
b=Customer[Customer['PaymentMethod']=='Mailed Check']
c=Customer[Customer['PaymentMethod']=='Bank transfer']

--

--