Visualization using Python Matplotlib

What is data visualization?

Sam Yang
6 min readDec 17, 2021

The complete workflow of data analysis includes data acquisition (or collection), data enhancement (or pre-processing),data analysis (i.e. feature detection or extraction), visualization and rendering. As a final step, visualization is also an important part of the whole workflow. Data visualization is not only presentation, but also serves the purposes of data exploration and analysis itself. The simplest form of data visualization is to generate the chart for data. Given numerous types of charts available, it’s important to choose the particular type of chart that meets your project very well. In this article, we use Python as example to illustrate how to generate and differentiate various types of charts.

Introduce Python Matplotlib

Python is probably the most popular and powerful tools for data analysis. Matplotlib library is used for creating both static and interactive 2D plots in Python. To start, users are encouraged to try to install the library. There are numerous ways of doing this, we show how to use pip command to install, but users can choose any way by simply google the internet.

pip install matplotlib

After installation, Python needs to import the module before using it. Python commands for import is:

import matplotlib.pyplot as plt

The pyplot is the Matplotlib function for plotting, and we give it an alias plt so that we can call this plotting module quickly in the later python codes.

Line chart

Assume we have collected day-wise sales data as shown in the table below. First, we use the most common and simple line chart to plot the sales data.

In order to show xticks as weekdays instead of numbers 1 through 3, we set my_xticks explicitly in the following Python codes. The plt.plot means calling the plot() from the imported module plt.

import matplotlib.pyplot as plt
my_xticks = [‘Mon’, ‘Tue’, ‘Wed’, ‘Thu’, ‘Fri’]
sales1 = [50, 59, 65, 40, 35]
sales2 = [40, 32, 51, 30, 43]
sales3 = [40, 58, 60, 70, 45]
plt.plot(x, sales1, x,sales2, x,sales3)
plt.xticks(x, my_xticks)
plt.show()

Bar Chart

The line chart is good for showing the relation between the two variables, e.g. sales on different weekdays. However, the line chart is unable to efficiently depict comparison between the weeks for which the sales data is plotted. Bar charts can serve our purposes better in this case. In the Python code below, we will use plt.bar function to plot the bar chart (while we used plt.plot to plot the line graph in the example before). Note that we introduce numpy in the following Python code. Numpy is another extremely useful module for Python. The reason we introduce numpy is that we want to define x array using numpy data structure such that we can legally perform addition of value to array operation (e.g. x+0.2).

import matplotlib.pyplot as plt
import numpy as np
my_xticks = [‘Mon’, ‘Tue’, ‘Wed’, ‘Thu’, ‘Fri’]
x = np.arange(5)
sales1 = [50, 59, 65, 40, 35]
sales2 = [40, 32, 51, 30, 43]
sales3 = [40, 58, 60, 70, 45]
temp = [8.5, 10.5, 6.8]
plt.bar(x-0.2, sales1, width=0.2)
plt.bar(x, sales2, width=0.2)
plt.bar(x+0.2, sales3, width=0.2)
plt.xticks(x, my_xticks)
plt.show()

Scatter Plot

A scatter chart is quite similar to a line plot except that it uses dots to represent the values obtained. Scatter plots are often used in the cases similar to line plots, i.e. to show the relationship between two variables like sales and weekdays. Scatter plots are sometimes called correlation plots because they show how two variables are correlated. Professional users often utilize the additional tool such as size, shape or color of the dot to represent a third or even fourth variable. In the Python code below, we use plt.scatter to plot the scatter plot. For each sales data, we use different values for s (which stands for size) in plt.scatter function. We introduce how to insert legend to distinguish among different sets of data, such as sales 1, sales2, and sales 3, by using plt.legend([‘Sales1’, ‘Sales 2’, ‘Sales 3’]). For better formatting, we put all the legends into a single list variable in Python called legends, and quote it in plt.legend function. The corresponding Python code and plot is shown as below.

import matplotlib.pyplot as plt
import numpy as np
my_xticks = [‘Mon’, ‘Tue’, ‘Wed’, ‘Thu’, ‘Fri’]
x = np.arange(5)
sales1 = [50, 59, 65, 40, 35]
sales2 = [40, 32, 51, 30, 43]
sales3 = [40, 58, 60, 70, 45]
temp = [8.5, 10.5, 6.8]
legends = [‘Sales 1’, ‘Sales 2’, ‘Sales 3’]
plt.scatter(x, sales1, s=20)
plt.scatter(x, sales2, s=50)
plt.scatter(x, sales3, s=100)
plt.xticks(x, my_xticks)
plt.legend(legends)
plt.show()

Histogram Plot

Histogram plots are column-charts with each column representing a range of values. The height of a column corresponds to how many values are in that range. To make a histogram, the data is sorted into bins and the number of data points in each bin is counted. Histogram charts have the unique advantage of dealing with data points with only one value. We use numpy function randint to generate a random series of numbers and store them in the array x. The parameters 10, 90 and size=200 means generating 200 random integers between 10 and 90. To view the distribution, we use plt.hist function to plot the histogram. We can set the number of bins used by varying the parameter bins in the hist function. For comparison, we choose bins=5 and bins=20 respectively, and plot them side-by-side using plt.subplot. The parameters 1,2,1 in the subplot functions indicate there are 1 row and 2 columns of sub-plots, and the last 1 in 1,2,1 means the current subplot is the 1st one. We introduce another function plt.xlim to set the range of x-axis from 0 to 100.

x = np.random.randint(10,90,size=200)
plt.subplot(1,2,1)
plt.hist(x, bins=5)
plt.subplot(1,2,2)
plt.hist(x,bins=20)
plt.xlim([0,100])
plt.show()

Distribution Plot

Distribution plot is also called density function. A density plot is a smoothed, continuous version of a histogram plot. The matplotlib.pyplot does not offer such a feature, so we use a new module called seaborn to plot the density plot using seaborn.distplot. The Python code and plots are shown below.

import seaborn
x = np.random.randint(10,90,size=200)
plt.subplot(1,2,1)
seaborn.distplot (x, bins=5)
plt.subplot(1,2,2)
seaborn.distplot(x,bins=20)
plt.xlim([0,100])
plt.show()

Boxplot

Quartiles are the measures which divide the data into four equal parts, and each part contains an equal number of observations. For example, we often use Quartile to find the top 25 percent of students in an exam. A box plot is the visual representation of the statistical summary of Quartile, which includes minimum value, quartile 1, quartile 2, Median, quartile 4 and maximum value. The whiskers are the two lines outside the box representing the highest and lowest values. Another use of boxplot is to identify the outliers. An outlier is an observation that is statistically different from the rest of the data. We still use the same random data set as used in the histogram and density plots above. Matplotlib.pyplot has a function boxplot for our plotting purpose. Python code is straightforward, and the resulting figure is also shown here.

x = np.random.randint(10,90,size=200)
plt.boxplot(x)
plt.show()

--

--