Getting started with Data visualization using Matplotlib

SHRIKRISHNA MISHRA
Analytics Vidhya
Published in
10 min readApr 13, 2020
Photo by Webaroo on Unsplash

Exploring and Analyzing the data is one of the most important part in any Data Science project. However, exploring the data statistically gives you the insights but, visualization of the data can show you something which was obvious and you couldn’t see.

Matplotlib is a wide and robust library written in python. In this article I’m only going to scratch the surface of it. This is more of a tutorial than article.

This tutorial contains :-

  1. Set up and downloading the data
  2. Line
  3. Histogram
  4. Bar
  5. pie
  6. Scatter

Set up and download the data

Required packages for this tutorial are pandas, numpy and matplotlib. you can download these packages from your python command line using pip command.

pip install matplotlib
pip install pandas
pip install numpy
pip install sklearn
pip install seaborn

download the data from Here .

open jupyter notebook and import the required packages.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set() # it will make graphs look better. Try without it also, you will see the difference by yourself.
%matplotlib inline # magic command

Line

data= np.random.randint(0,10,size=5)
plt.plot(data)
Simple line plot

line is rendered with plot() function.

  1. accepts a single iterable #required parameter
  2. values are plotted along the y-axis
  3. x-axis values defaults to consecutive integers starting from the zero
  4. by default the color of the line is blue
data= [50, 60, 39 ,56, 78,80, 48, 29, 89, 38]
labels=['A', 'B','C' ,'D', 'E', 'F', 'G','H', 'I','J']
plt.figure(figsize=(5,3))
plt.plot(data)
plt.xticks(ticks=np.arange(10), labels=labels)# ticks list or ndarray and labels is list
plt.xlabel("Characters", fontsize=16)
plt.ylabel("Values", fontsize=16);
plt.title("Line Plot", fontsize=20);
Output for above code snippet

Ticks on the ‘X’ and ‘Y’ axes can be controlled by xticks() and yticks() methods provided by matplotlib.pyplot package. Inside the xticks() and yticks() we can mention the labels also to customize the labels of the ticks. Note: labels in ticks is different from xlabel() and ylabel()

color and style

styles = ['solid', 'dashed', 'dashdot', 'dotted', 'None']

for i, sty in zip(np.arange(10), styles):
plt.plot(np.arange(10)+i, ls=sty, linewidth=i+5)
Output for above code snippet
  1. we can set the color of the line by assigning the value to the ‘color’ keyword or ‘c’ . Possible values can be any hexadecimal color code. There are 8 frequently used colors which we can mention directly by their names ‘red’, ‘blue’, ‘magenta’, ‘black’, ‘yellow’, ‘green’, ‘cyan’,’white’. You can mention abbreviations instead of their names which will be the first letter of the name except black for which we use ‘k’ to remove the ambiguity between ‘blue’ and ‘black’.Even if we don’t mention the color of the line, matplotlib will assign the different color to different lines in a single graph according to cycle. We can change the color cycle and also we can make our own custom cycle. Checkout the documentation if you want to learn more about the cylcles Here
  2. linestyles:There are 5 different linestyles provided by matplotlib which are shown in graph.
  3. linewidth: optional integer argument to customize the width of the line.
styles = [':','-.','-','--',' '] #linestyle can also be provided as abbreviations
for i, sty in zip(np.arange(10), styles):
plt.plot(np.arange(10)+i, ls=sty, linewidth=i+5)
Output for the above code snippet
# There is one more method to control the line style 
plt.plot(np.arange(10), linestyle=(3, (1, 10, 5, 1)), linewidth=5);
plt.plot(np.arange(10)+1, linestyle=(3, (1, 2, 4, 1)));
Output for the above code snippet

This type of line is called as dash tuple reference. It gives more refined control to programmer .We need to provide offset and even number of on-off sequence. linestyle= (3, (1,2,4,1)) it means (offset, (1pt on, 2pt off, 4pt on , 1pt off)) An on-off sequence(1,1) will give dotted line.

practical example

iris_data = pd.read_csv('iris_with_cluster.csv')
plt.figure(figsize=(10,6))
plt.plot(iris_data['sepal_length'], linestyle="solid", linewidth=2, color='r', solid_joinstyle='miter', label="Sepal length")plt.plot(iris_data['sepal_width'], linestyle="-.", linewidth= 2, color= "black", label="Sepal Width")plt.plot(iris_data['petal_length'],linestyle='dashed', linewidth=1, color="green", dash_capstyle= "butt", label="Petal length")
plt.plot(iris_data['petal_width'], linestyle=(0,(1,1,1,1)), linewidth=2, color="magenta", label="Petal Width")
plt.legend(loc="upper left",ncol=2, fontsize=12)
plt.ylabel("In cm", fontsize=16)
plt.title("Iris data line plot", fontsize=20);

solid_joinstyle: Gives control to the joint style of the line at each data point. Available options are ‘miter’, ‘round’ and ‘bevel (cut corners)’.

dash_capstyle: Available for the dashed line. Available options are ‘butt’, ‘round’ and ‘projecting’ .

There is much more to line plot than this . As I said in the beginning it’s just surface , you want to learn more go to official documentation of matplotlib.

Histogram

Histogram plot is used to visualize frequency distribution of the data. It’s same as we calculate in statistics i.e. how much data points falls under the interval.

Histogram is generated with hist() function:

  1. Accepts a dataset
  2. Divides the dataset into equal intervals . Matches value with each interval
  3. Plots the intervals on X-axis and frequency on Y-axis.
Output
data = np.random.randn(1000)
plt.hist(data)

By providing the bins parameter we can customize the number of intervals. More number of bins show more detail.

Output
data = np.random.randn(1000)
plt.hist(data, bins=20)

More number of bins distorts the plot as well

data= np.random.randn(10000)
plt.hist(data,bins=70,histtype='step',cumulative=True,label='cdf <')
plt.hist(data,bins=70,histtype='step',cumulative=True,label='cdf >')
plt.legend(loc='upper center', fontsize=16)
Output

By providing cumulative(boolean) parameter we can get cumulative distribution of the data. True value will give ‘greater than(empirical)’ and -1 value will plot the ‘less than(rev. empirical)’ graph.

histtype: By providing histtype parameter we can control the bin style. Available options are ‘step’, ‘stepfilled’ , ‘bar (unequal bin width)’ .

You can provide the color argument same as line plot.

Practical example

plt.figure(figsize= (10,6))
plt.hist(iris_data['sepal_length'], bins=25, histtype='step', label='sepal length')
plt.hist(iris_data['sepal_width'], bins=25, histtype='step', label='sepal width')
plt.hist(iris_data['petal_length'], bins=25, histtype='step', label='petal length')
plt.hist(iris_data['petal_width'], bins=25, histtype='step',label='petal width')
plt.legend(loc='upper right', fontsize=16)
plt.title("Iris data Histogram plot", fontsize=20)
plt.xlabel('Measure in cm', fontsize=16)
plt.ylabel('Frequency', fontsize=16)

Now we will move on to bar plots.

Bar plot

Bar plot is also called as column chart.It works same as line chart, but instead of plotting a point it draws the bar of a height equal to y-coordinate.It somewhat looks like a histogram, but it’s not.

It plotted with bar() function:
1. It takes two required argument.
2. first is X -coordinate values. Mostly it’s length of the data to be plotted.
3. second required argument is y- coordinate values.
4. The taller the bar greater the value.

Output
plt.bar(np.arange(20), np.random.randint(0,40,20))
plt.xticks(ticks=np.arange(20));

Color and ticks all works same as described above.

Output
labels=['A', 'B', 'C', 'D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T']
plt.barh(np.arange(20), np.random.randint(0,20,20))
plt.yticks(ticks=np.arange(20), labels=labels)

With barh() fuction we can plot the horizontal bars. note: we don’t need to swap the passed parameters. length of the bar represents the magnitude of the value. from here i will show some advanced bar plots with help of practical example.

Practical example

plt.figure(figsize=(14,6))
length= 40
sepal_length=iris_data['sepal_length'][:40]
sepal_width=iris_data['sepal_width'][:40]
petal_length = iris_data['petal_length'][:40]
petal_width=iris_data['petal_width'][:40]
plt.bar(np.arange(length), sepal_length,label='sepal length' )
plt.bar(np.arange(length), sepal_width, label='sepal width', bottom=sepal_length)
plt.bar(np.arange(length), petal_length, label='petal width', bottom=sepal_length+sepal_width)
plt.bar(np.arange(length), petal_width, label='petal length', bottom= sepal_length+sepal_width+petal_length)
plt.legend(loc='best', fontsize=14, ncol=4)
plt.xticks(ticks= np.arange(40))
plt.title('Stacked bar chart', fontsize=20)
plt.ylabel('Measure in cm', fontsize=16)
Output

Stacked bar chart is plotted with consecutive bar() functions in the same figure. We can use it to visualize multiple datasets and compare them. In stacked bar chart we plot the value of one dataset over the other. X and Y coordinate value will be same as before. By providing ‘bottom’ keyword argument you can make sure one plot don’t overshadow the one plotted before it. It tells the matplotlib from where on the y-coordinates should start counting. ‘label’ will acts same as in line or histogram.

We can make the same bar plot horizontally. Just change the bar() with barh() and bottom argument with left. Using this visualization we can compare the range of their values. Here plot shows a petal length values are smallest.We may have to normalize them before using it to train the model.

data=iris_data.groupby('cluster').mean()
group_width=0.8
x=data.shape[0]
plt.figure(figsize=(10,6))
plt.bar(np.arange(x)- group_width/4, data['sepal_length'], width=group_width/4, label='Sepal length' )
plt.bar(np.arange(x)-group_width/2, data['sepal_width'], width=group_width/4, label='Sepal width')
plt.bar(np.arange(x) , data['petal_length'], width= group_width/4, label='Petal length')
plt.bar(np.arange(x)+group_width/4, data['petal_width'], width=group_width/4, label='Petal width')
plt.legend(loc='upper left', fontsize=14, ncol=2)
plt.ylabel('Mean length/Width(in cm)', fontsize=14)
plt.xlabel('Cluster', fontsize=14)
plt.title('Grouped bar chart', fontsize=18)
Output

Grouped bar chart is almost same as stacked , but instead of showing values one at another , it shows values parallel to each other. X coordinates of all the bar() function will be different . Here in the first plot X=np.arange(x)- group_width/4 will plot the sepal_length i.e. blue bar at (0–0.8/4 =)-0.2 (-0.2 to 0). In the second bar() function X= np.arange(x)- group_width/2 will plot the orange bar(sepal_width) at (0–0.8/2 =) -0.4 (-0.4 to -0.2) . Green bar(petal_length) will be from 0 to 0.2 and red(petal_width) will be from 0.2 to 0.4). Next group’s orange bar will start from(1–0.8/2=)0.6 to 0.8 . Space between two groups is (1- group_width= ) 0.2. Width of the each plots should be 1/4 of the group width to remove ambiguity. Both bar shows that petal width has lowest range of values and sepal length has highest which is obvious. Hence our data is correct. ‘width’ argument controls the width of the bar.

Try to plot the same plot in horizontal bar chart. You just need to change the bar() by barh() and ‘bottom’ argument by ‘left’ argument in stacked bar chart. The output should like below diagram.

Stacked bar chart (horizontal)

Pie chart

  • It represent the part of the whole.
    - Displays only single dataset.
    - all values are shown as percentages related to 360 degree angle of the circle.
    - each wedge is shown as different color.
    - percent values adds up to 100%.
    - the size of each value is proportional to total of the values.
    - each value is represented as slice of the dataset.
Output
wedges=[38, 45, 56 ,78]
plt.pie(wedges);

Pie chart is plotted with pie(): One required argument is dataset.

labels=['Highly unsatisfied', 'unsatisfied','satisfied','Highly satisfied']plt.pie(wedges, labels = labels, rotatelabels=True,           labeldistance=1.2, autopct='%0.2f%%');
Output
  • ‘labels’ argument accepts a list. It shows the labels on the corresponding color by default.
  • ‘rotatelabels’ argument accepts a bool. Rotates the labels by the corresponding angle.
  • ‘labeldistance’ argument accepts a float value. It’s distance of labels from the center of the circle where radius=1.
  • ‘autopct’ argument accepts format string . It shows values in percentage on the wedges inside the circle.

By this plot if there are two edges which don’t have much different. It will be hard to identify from naked eyes. for this matplotlib provides ‘explode’ argument.

explosion = [0.3, 0, 0, 0]
plt.pie(wedges, labels= labels, autopct="%0.2f%%", explode=explosion, radius=1.2)
plt.savefig('pie3')
Output

‘explode’ argument accepts array of floats. float value shows the unit by which wedge should be exploded. value 0.0 will not explode the wedge. ‘radius’ argument also accepts float value. It controls radius of the circle.

Practical example

data= iris_data.groupby('cluster').count()
data['count'] = data['sepal_length']
wedges = data['count']
labels= ['cluster 0', 'cluster 1', 'cluster 2', 'cluster 3', 'cluster 4', 'cluster 5', 'cluster 6']
explosion = [0, 0.2, 0, 0,0, 0.2,0]
plt.figure(figsize=(5,6))
plt.pie(wedges, autopct="%0.2f%%", radius=1.5, explode=explosion, startangle=90)
plt.legend(labels, loc=[1.2,0])
plt.title('Percentage of flowers in different clusters',pad=30, fontsize=16)
  • ‘startangle’ argument provides the angle from which first wedge will be plotted

You can see that straight line now is at 90 degree angle.

  • ‘pad’ argument in title() function provides padding for the title above the y-axes. It applies to all the plots in the tutorial.

Scatter plot

Scatter plot is the most used plot in matplotlib. It displays the individual points. Order is insignificant . It shows the relationship between two values. In regression problem by looking at the scatter plot we decide which curve can be fitted .

Output

It plotted by scatter() function:

  • Two required argument.
  • First is the x-axis.
  • Second is the y-axis.
x= np.linspace(0, 50, 30)
y= np.log(x)
plt.scatter(x, y)
Output

This plot is same as above even when we shuffled the points . It shows that the order is insignificant for scatter plot.

import random
points = list(zip(x, y))
random.shuffle(points)
points = np.array(points)
plt.scatter(points[:,0],
points[:,1])

This graph is plotted with small noise . logarithmic curve shows that it can be good fit to the data points.

x= np.linspace(0, 50, 30)
y= np.log(x)+np.random.randn(30)*0.3
plt.scatter(x, y)
plt.plot(x, np.log(x), c= 'r')

Practical example

iris_data['sepal_area'] = iris_data['sepal_length']*iris_data['sepal_width']
iris_data['petal_area'] = iris_data['petal_length']*iris_data['petal_width']
x= iris_data['sepal_area']
y= iris_data['petal_area']
plt.figure(figsize=(8,6))
plt.scatter(x,y, c=iris_data['cluster'],s=iris_data['cluster']*15, cmap='hsv')
plt.xlabel('Sepal area', fontsize=14)
plt.ylabel('Petal area', fontsize=14)
plt.title('Iris data clusters', fontsize=20)

By providing the ‘s’ argument you can control the size of each points. length of the iris_data[‘cluster’] is same as the number of points. scatter() plots the each point in an iterative manner hence size of each point can be different. The size provides the extra dimension to the plot. ‘c’ argument also takes integer array which gives the different color to the each point . ‘c’ argument joins the integer with character ‘c’ as C0, C1 etc. ‘cmap’ argument provides which color would be C0, C1, etc. Other options for cmap are ‘rainbow’, ‘seismic’, ‘twilight’, ‘shifted_twilight’, ‘plasma’ these are frequently used by me. There are many more options available, check out the documentation.

I think now you can plot the basic plots with confidence. For more customized plots checkout the documentation of matplotlib, it’s amazing .

--

--

SHRIKRISHNA MISHRA
Analytics Vidhya

An ordinary student aspire to be data scientist. Love to learn from the internet. Trying to give back by writing tutorial blogs