Exploratory Data Visualization Using Matplotlib

Data visualization is a vital part of the embedded data scientist’s toolbox. Although it is very easy to create a visualization, producing good ones is far more difficult.

Payal Kumari

Published in

Geek Culture

9 min readJun 2, 2022

What to expect

This article focuses on developing the skills required to start exploring our own data and create effective visualization. Data visualization can be done with various tools like Tableau, Power BI, and Python. As mentioned earlier in my previous article, data analytics allows analyzing datasets in order to make decisions about the information and helps in enhancing the business by predicting the required conclusion.

In this article, you will learn how to visualize data with the help of the matplotlib library, to further determine which products require extra attention in order to boost the overall sales volume of the organization.

Data is everywhere. You use and create data every day, but not all of it is correct. Every time you use your phone, check something up online, make a credit card purchase, or listen to music, you produce data. You rely on data to determine if something is true or false, but you rarely see this data in its raw state. You can see how interpreting rows and columns of numbers may be difficult. As a result, you usually use a method called data visualization to more readily illustrate patterns and trends in data.

What and why data visualization is important?

Data visualization is the translation of data into graphical representations like charts and graphs to communicate the data’s significance. However, while this method simplifies the process of understanding data, it can also be used to bend the truth and misrepresent information.

Matplotlib

Matploptib is a Python low-level package used for data visualization. If you want to create complicated interactive visualizations for the web, it is probably not the best option, but it is straightforward to use for bar charts, line charts, and scatterplots. This library is built on NumPy arrays and includes numerous plots such as line charts, bar charts, histograms, and so on. It offers a lot of flexibility at the expense of writing more code.

Let’s get started. First, you will be using the pip command to install this module. If you do not have pip installed then refer to this https://pip.pypa.io/en/stable/installation/. I will be using matplotlib using the Jupyter notebook.

To install Matplotlib type the below command in the terminal.

pip install matplotlib

After the installation is completed. Let’s get started with Matplotlib and Jupyter Notebook. Using Matplotlib, you will create several graphs in Jupyter Notebook.

#importing pyplot from matplotlib 
from matplotlib import pyplot as plt

Here instead of typing import pyplot to import that module, you could import it and give it a nickname such as plt, which is shorter. Then, in the following code, you would not refer to the module by its entire name, pyplot. Instead, use the shorter name, plt, as seen in the following code.

Pyplot is a Matplotlib package with a MATLAB-style interface. Matplotlib is intended to be as user-friendly as MATLAB, but with the added benefit of being free and open-source. Each pyplot function alters a figure in some way, such as creating a figure, plotting an area in a figure, plotting certain lines in a plotting area, decorating the plot with labels, and so on. Pyplot supports the following plot types: Line Plot, Histogram, Scatter, 3D Plot, Image, Contour, and Polar.

After knowing a brief about Matplotlib and pyplot, let’s see how to create a simple plot.

# initializing the data with the list of data items, separated by commas, inside the square brackets.# x-axis value
years = [1950, 1960,  1970,  1980, 1990, 2000, 2010]# y-axis value
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]#create a line chart, years on x-axis, gdp on y-axis  plt.plot(years, gdp, color = "green", marker = "o", linestyle = "solid")# add a title 
plt.title("Nominal GDP")# add a label to the y-axis
plt.ylabel("Billions of $")#function to show the plot
plt.show()

Line Chart

As you saw already, you can make line charts using plt.plot. These are a good choice for showing trends, as illustrated below.

variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256,128,64,32,16,8,4,2,1]
total_error = [x+y for x, y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]#You can make multiple calls to plt.plot to show multiple series on the same chart
plt.plot(xs, variance, 'g-', label='variance')  #green solid line
plt.plot(xs, bias_squared, 'r-', label='bias^2') #red dot-dashed line
plt.plot(xs, total_error, 'b-', label='total_error')  #blue dotted line#Because you have assigned labels to each series you can get a legend for free (loc =9 means "top center")
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.xticks([])
plt.title("The Bias-Variance Tradeoff")
plt.show()

Line charts are preferable when there is a small change exist. It is used to compare changes in more than one group during the same time period. The disadvantage of using the line chart is that they tend to lose clarity when there are too many data points as shown above.

Bar Chart
A bar chart is a good choice when you want to how some quantity varies among some discrete set of items. For instance, a figure below shows how many academic awards were won by each of a variety of movies:

movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"]
num_oscars = [5,11,3,8,10]#plot bars with left x-coordinates [0,1,2,3,4], heights[num_oscars]
plt.bar(range(len(movies)), num_oscars)# add a title
plt.title("My Favorite Movies")#label the y-axis
plt.ylabel("# of Academy Awards")#label x-axis with movie names at bar centers
plt.xticks(range(len(movies)),movies)plt.show()

A bar chart can also be a good choice for plotting a histogram of bucketed numeric values as shown below:

from collections import Counter
grades = [83, 95,91, 87, 70, 0, 85, 82, 100, 67, 73,77,0]#Bucket grade by decile, but put 100 in with the 90s
histogram = Counter(min(grade//10*10,90) for grade in grades)
plt.bar([x+5 for x in histogram.keys()],
       histogram.values(),
       10,
       edgecolor=(0,0,0))
plt.axis([-5, 105, 0 ,5])
plt.xticks([10*i for i in range(11)])
plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")
plt.show()

The third argument to plt.bar specifies the bar width. Here we chose a width of 10, to fill the entire decile. We also shifted the bars right by 5, so that, for example, the “10” bar (which corresponds to the decile 10–20) would have its center at 15 and hence occupy the correct range. We also added a black border to each bar to make them visually distinct. The call to plt.axis indicates that we want the x-axis to range from –5 to 105 (just to leave a little space on the left and right), and that the y-axis should range from 0 to 5. And the call to plt.xticks puts x-axis labels at 0, 10, 20, …, 100.

Be wise when using plt.axis. When creating bar charts it is considered bad form for your y-axis not to start at 0, since this is an easy way to mislead people.

mentions = [500, 505]
years = [2017, 2018]
plt.bar(years, mentions, 0.8)
plt.xticks(years)
plt.ylabel("# of times I heard someone say 'data science'")# if you dont do this, matplotlib will label the x-axis 0.1 and then add a +2.013e3 off in the corner (bad matplotlib)
plt.ticklabel_format(useOffset= False)# misleading y-axis only shows the part above 500
plt.axis([2016.5, 2018.5, 499, 506])
plt.title("Look at the 'Huge' Increase!")
plt.show()

In the below figure you use more sensible axes, and it looks far less impressive:

# Below chart you can see you have used more sensible axes, and its looks far less impressive:mentions = [500, 505]
years = [2017, 2018]
plt.bar(years, mentions, 0.8)
plt.xticks(years)
plt.ylabel("# of times I heard someone say 'data science'")# if you dont do this, matplotlib will label the x-axis 0.1 and then add a +2.013e3 off in the corner (bad matplotlib)
plt.ticklabel_format(useOffset= False)#######
plt.axis([2016.5, 2018.5, 0, 550])
plt.title("Not So Huge Anymore")
plt.show()

The same chart with a non-misleading y-axis

The disadvantage of using the bar chart is that it usually requires additional explanation in the form of written or visual and can be easily manipulated to give false impressions.

Scatterplots

A scatterplot is a good choice when you are looking to visualize the relationship between two paired datasets.

The example below illustrate the relationship between the number of friends your users have and the number they spend on the site every day :

friends = [70, 65, 72, 63, 71, 64, 69, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a','b','c','d','e','f','g','h','t']plt.scatter(friends, minutes)#label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
    plt.annotate(label,
                 xy=(friend_count, minute_count),
                 xytext=(5, -5),
                 textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()

A scatterplot of friends and time on the site

Note: If you are scattering comparable variables, you might get a misleading picture if you let matplotlib choose the scale as shown below:

test_1_grades = [99, 90, 85, 97, 80]
test_2_grades = [100, 85, 60, 90, 70]plt.scatter(test_1_grades, test_2_grades)
plt.title("Axes Aren't Comparable")
plt.xlabel("test 1 grades")
plt.ylabel("test 2 grades")
plt.show()

The plot below shows more accurate variation as you included a new line to the code.

test_1_grades = [99, 90, 85, 97, 80]
test_2_grades = [100, 85, 60, 90, 70]plt.scatter(test_1_grades, test_2_grades)
plt.title("Axes Aren't Comparable")
plt.xlabel("test 1 grades")
plt.ylabel("test 2 grades")# if you include a call to plt.axis("equal") it will provide more accurate shows that most of the variation occurs on test 2.plt.axis("equal")

The disadvantage of using the scatterplot is that it failed to give you the exact extent of correlation. Also, it doesn’t show the quantitative measure of the relationship between the two variables.

If you guys are interested in reading about how machine learning application is revolutionizing the healthcare industry, click here.

Thank you for reading this story and see you again. Feel free to leave a comment if you have any thoughts, feedback, or suggestions about it!

References

Data Science from Scratch, 2nd Edition

Data science libraries, frameworks, modules, and toolkits are great for doing data science, but they're also a good way…

www.oreilly.com

Examples - Matplotlib 3.5.2 documentation

This page contains example plots. Click on any image to see the full image and source code. For longer tutorials, see…

matplotlib.org

Seaborn: statistical data visualization - seaborn 0.11.2 documentation

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing…

seaborn.pydata.org

Altair: Declarative Visualization in Python - Altair 4.2.0 documentation

With Altair, you can spend more time understanding your data and its meaning. Altair's API is simple, friendly, and…

altair-viz.github.io

Bokeh documentation

Bokeh is a Python library for creating interactive visualizations for modern web browsers. It helps you build beautiful…

docs.bokeh.org