Exploratory Data Visualization Using Matplotlib

Data visualization is a vital part of the embedded data scientist’s toolbox. Although it is very easy to create a visualization, producing good ones is far more difficult.

Payal Kumari
Geek Culture
9 min readJun 2, 2022

--

What to expect

This article focuses on developing the skills required to start exploring our own data and create effective visualization. Data visualization can be done with various tools like Tableau, Power BI, and Python. As mentioned earlier in my previous article, data analytics allows analyzing datasets in order to make decisions about the information and helps in enhancing the business by predicting the required conclusion.

In this article, you will learn how to visualize data with the help of the matplotlib library, to further determine which products require extra attention in order to boost the overall sales volume of the organization.

Data is everywhere. You use and create data every day, but not all of it is correct. Every time you use your phone, check something up online, make a credit card purchase, or listen to music, you produce data. You rely on data to determine if something is true or false, but you rarely see this data in its raw state. You can see how interpreting rows and columns of numbers may be difficult. As a result, you usually use a method called data visualization to more readily illustrate patterns and trends in data.

What and why data visualization is important?

Data visualization is the translation of data into graphical representations like charts and graphs to communicate the data’s significance. However, while this method simplifies the process of understanding data, it can also be used to bend the truth and misrepresent information.

Matplotlib

Matploptib is a Python low-level package used for data visualization. If you want to create complicated interactive visualizations for the web, it is probably not the best option, but it is straightforward to use for bar charts, line charts, and scatterplots. This library is built on NumPy arrays and includes numerous plots such as line charts, bar charts, histograms, and so on. It offers a lot of flexibility at the expense of writing more code.

Let’s get started. First, you will be using the pip command to install this module. If you do not have pip installed then refer to this https://pip.pypa.io/en/stable/installation/. I will be using matplotlib using the Jupyter notebook.

To install Matplotlib type the below command in the terminal.

After the installation is completed. Let’s get started with Matplotlib and Jupyter Notebook. Using Matplotlib, you will create several graphs in Jupyter Notebook.

Here instead of typing import pyplot to import that module, you could import it and give it a nickname such as plt, which is shorter. Then, in the following code, you would not refer to the module by its entire name, pyplot. Instead, use the shorter name, plt, as seen in the following code.

Pyplot is a Matplotlib package with a MATLAB-style interface. Matplotlib is intended to be as user-friendly as MATLAB, but with the added benefit of being free and open-source. Each pyplot function alters a figure in some way, such as creating a figure, plotting an area in a figure, plotting certain lines in a plotting area, decorating the plot with labels, and so on. Pyplot supports the following plot types: Line Plot, Histogram, Scatter, 3D Plot, Image, Contour, and Polar.

After knowing a brief about Matplotlib and pyplot, let’s see how to create a simple plot.

A simple line chart

Line Chart

As you saw already, you can make line charts using plt.plot. These are a good choice for showing trends, as illustrated below.

Several line charts with a legend

Line charts are preferable when there is a small change exist. It is used to compare changes in more than one group during the same time period. The disadvantage of using the line chart is that they tend to lose clarity when there are too many data points as shown above.

Bar Chart
A bar chart is a good choice when you want to how some quantity varies among some discrete set of items. For instance, a figure below shows how many academic awards were won by each of a variety of movies:

A simple bar chart

A bar chart can also be a good choice for plotting a histogram of bucketed numeric values as shown below:

Using a bar chart for a histogram

The third argument to plt.bar specifies the bar width. Here we chose a width of 10, to fill the entire decile. We also shifted the bars right by 5, so that, for example, the “10” bar (which corresponds to the decile 10–20) would have its center at 15 and hence occupy the correct range. We also added a black border to each bar to make them visually distinct. The call to plt.axis indicates that we want the x-axis to range from –5 to 105 (just to leave a little space on the left and right), and that the y-axis should range from 0 to 5. And the call to plt.xticks puts x-axis labels at 0, 10, 20, …, 100.

Be wise when using plt.axis. When creating bar charts it is considered bad form for your y-axis not to start at 0, since this is an easy way to mislead people.

A chart with a misleading y-axis

In the below figure you use more sensible axes, and it looks far less impressive:

The same chart with a non-misleading y-axis

The disadvantage of using the bar chart is that it usually requires additional explanation in the form of written or visual and can be easily manipulated to give false impressions.

Scatterplots

A scatterplot is a good choice when you are looking to visualize the relationship between two paired datasets.

The example below illustrate the relationship between the number of friends your users have and the number they spend on the site every day :

A scatterplot of friends and time on the site

Note: If you are scattering comparable variables, you might get a misleading picture if you let matplotlib choose the scale as shown below:

A scatterplot with uncomparable axes

The plot below shows more accurate variation as you included a new line to the code.

The same scatterplot with equal axes

The disadvantage of using the scatterplot is that it failed to give you the exact extent of correlation. Also, it doesn’t show the quantitative measure of the relationship between the two variables.

--

--