Fundamentals of Data Visualization in python
- Data Visualization is the representation of data graphically or pictorially. Allows high-level Representatives to see analytics, grasp difficult concepts and identify new patterns at ease.
Ex: Kmeans clustering, Its easier to understand the cluster labeling done by an algorithm of data.Not possible to label clusters just by checking raw data.
Famous Anscombe’s Quartet Example explains the importance of Data Visualization. Python code is readily available.
Matplotlib in python
- open-source python lib for visualization
- creates graphs and plots using a python script. It allows for saving plots in local systems. Provides Object-oriented API
plt.savefig(‘ScatterPlot.png’)
- It has a Module called Pyplot, which has simple functions used for visualization. (line, images, text, labels, etc)
- Supports a wide range of graphs
Line plot, Bar plot, Scatter Plot, Histogram
Image Plot, Box plot, Violin Plot, Stream plot, Quiver plot, Area Plot, Peter Plot, and Donut Plot
- Easy integration with Pandas and Numpy.
Line Plot
import numpy as np
import matplotlib.pyplot as plt
# allows plot to be display below the notebook
%matplotlib inline#defining the dataset
x=np.arange(0,10,0.1)
y=3*x+5#plotting the datapoints
plt.plot(x,y)
plt.show()
Customizing Line plots (Compare with basic line plot)
import numpy as np
import matplotlib.pyplot as plt
# allows plot to be display below the notebook
%matplotlib inline#defining the dataset
x=np.arange(0,10,1)
y=3*x+5#plotting the datapoints
plt.plot(x,y,linewidth =2.0 , linestyle =":",color ='y',alpha =0.7, marker ='o')
plt.title("Line Plot Demo")
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.legend(['line1'], loc='best')
plt.grid(True)
plt.show()
Figure Size
...
y=3*x+5
#changing the figure
fig=plt.figure(figsize=(10,5))#plotting the datapoints
...
subplots
import numpy as np
import matplotlib.pyplot as plot
%matplotlib inlinex=np.arange(0,10,1)
y1=2*x+5
y2=3*x+10plt.subplot(2,1,1) #A
# B - plt.subplot(1,2,1) #(height,width,column)
plt.plot(x,y1)
plt.title('Graph1')plt.subplot(2,1,2) #A
# B - plt.subplot(1,2,2)
plt.plot(x,y2)
plt.title('Graph2')
plt.show()
Bar plot
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinedata = {'apples':20,'Mangoes':15, 'lemon':30,'Oranges':10}
names =list(data.keys())
values =list(data.values())plt.subplot(3,1,1)
#fig =plt.figure(figsize =(10,5))
plt.bar(names,values,color ="orange")
plt.title("Bar Graph Demo")
plt.xlabel("Fruits")
plt.ylabel("Quantity")plt.subplot(3,1,3)
plt.barh(names,values,color ="orange")
plt.title("Bar Graph Demo")
plt.xlabel("Fruits")
plt.ylabel("Quantity")
plt.show()
Scatter Plots
import matplotlib.pyplot as plt
%matplotlib inline#dataset Note - a and y1,y2 should be of same size
a=[10,20,30,40,50,60,70,80]
y1=[2,3,5,6,1,4,5,3]
y2=[1,2,3,4,5,5,1,3]plt.scatter(a,y1)
plt.scatter(a,y2)
plt.show()
Customizing Scatter plots compare with the basic plot
import matplotlib.pyplot as plt
%matplotlib inline#dataset Note - a and y1,y2 should be of same size
a=[10,20,30,40,50,60,70,80]
y1=[2,3,5,6,1,4,5,3]
y2=[1,2,3,4,5,5,1,3]plt.scatter(a,y1, c='g',s=300,edgecolors='y',marker='o',alpha=0.5)
plt.scatter(a,y2, c='y',s=400,edgecolors='b',marker='3',alpha=1)
plt.legend(['y1','y2'],loc='best')
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.grid(True)
plt.show()
Histogr.am
- A histogram is a graphical display of data using bars of different heights. In a histogram, each bar group numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data.
import matplotlib.pyplot as plt
%matplotlib inlinenumbers = [10,90,12,16,19,12,20,26,28,30,38,35,34,45,60,68,64,62,70,78,75,79,85,94,95]
plt.hist(numbers,bins=[0,20,40,60,80,100], color='#FFF233',edgecolor='#000000')
plt.title("Histogram Demo")
plt.grid(True)
plt.xlabel("Range of values")
plt.ylabel("Freq of values")
plt.show()
Box Plot and Violin Plot
- box plot helps to Analyse data efficiently and does the outer analysis of data such as outlier, Quartile, etc
- a violin plot is used for large amounts of data, where the individual representation of data is not possible.
import matplotlib.pyplot as plt
%matplotlib inline#data
total = [20,4,1,30,20,12,20,70,32,10]
order =[10,3,2,15,17,2,30,44,2,1]
discount = [30,10,20,5,10,20,50,60,20,45]
data = list([total, order, discount])
print(data)plt.boxplot(data,showmeans =True)
plt.title("Box plot Demo")
plt.grid(True)
plt.show()
import matplotlib.pyplot as plt
%matplotlib inline#data
total = [20,4,1,30,20,12,20,70,32,10]
order =[10,3,2,15,17,2,30,44,2,1]
discount = [30,10,20,5,10,20,50,60,20,45]
data = list([total, order, discount])
print(data)plt.violinplot(data,showmeans =True, showmedians=True)
plt.title("Violin plot Demo")
plt.grid(True)
plt.show()
Pie Chart, Donut Chart
import matplotlib.pyplot as plt
%matplotlib inline#prepare the dataset
label=['Dog','Cat','Wolf','Lion']
sizes=[50,45,60,80]plt.pie(sizes,labels =label)
plt.title("Pie Chart DEmo")
plt.show()
Customization
import matplotlib.pyplot as plt
%matplotlib inline#prepare the dataset
label=['Dog','Cat','Wolf','Lion']
sizes=[50,45,60,80]#add colors
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']plt.pie(sizes,labels =label, colors =colors,autopct='%1.1f%%' ,shadow=True ,startangle = 90, explode=(0,0.1,0,0))
plt.title("Pie Chart Demo")
plt.show()
#Donut plot
import matplotlib.pyplot as plt
%matplotlib inlinegroup_names = ["GroupA","GroupB","GroupC"]
group_size=[20,30,50]
size_centre = [5]#colors
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']pie1 =plt.pie(group_size, labels = group_names,radius =1.5,colors =colors)
pie2 = plt.pie(size_centre,radius =1.0,colors ='w')
plt.show()
Area Plots
- similar to Line plot, only difference Area under the slope is colored
import matplotlib.pyplot as plt
%matplotlib inline#dataset
x=range(1,17)
y=[1,4,6,8,4,5,3,8,8,8,4,1,5,6,8,7]plt.stackplot(x,y)
plt.show()
few customizations
import matplotlib.pyplot as plt
%matplotlib inline#dataset
x=range(1,17)
y=[1,4,6,8,4,5,3,8,8,8,4,1,5,6,8,7]plt.stackplot(x,y, colors ='green', alpha =0.5)
plt.plot(x,y, color='g')
plt.grid(True)
plt.show()
More Examples using pandas
DataSet: https://www.dropbox.com/s/v3ux6vy7ajvltz0/Customerdata.csv?dl=0
- Build a box-plot for the dataset. x-axis — Contract type, y-axis- count
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inlinecustomer= pd.read_csv(r'Customerdata.csv')
grp=customer.Contract.value_counts()
x=grp.keys()
y=grp.valuesprint(type(grp.values))plt.bar(x,y,color ="orange")
plt.title("Distribution of Contract in dataset")
plt.xlabel("Contract Type of Customer")
plt.ylabel("count")
plt.show()
2. Build a Histogram. x-axis: Monthly Charges Incurred, y-axis: count
3. Build scatter plot between TotalCharges(x-axis)vs Tenture(y-axis).
NOTE: Kernel keeps hanging for scatter plot visualization, restart and ensure too much data not used.
4. Build Box-plot .x-axis: Payment Method of Customer and y-axis: Monthly Charges incurred. There are 3 ways of Payment: Electronic Check, Mailed check and Bank transfer
Try it out yourself :)
Hint :
a=Customer[Customer['PaymentMethod']=='Electronic Check']
b=Customer[Customer['PaymentMethod']=='Mailed Check']
c=Customer[Customer['PaymentMethod']=='Bank transfer']