A simple guide to Data visualization with Python

Part 1: Basic Plots and their Customisations

Simarpreet Singh
Analytics Vidhya
8 min readSep 20, 2019

--

Any Data Science project involves a series of steps to perform certain tasks for extracting meaningful information from data to provide business value. The various roles in a data science project team that are responsible for performing these certain steps is depicted in the below image . Visualization means to represent data in a graphical form . Data Visualization is one important aspect of any data science project.

Python has two popular libraries for creating visualizations namely — Matplotlib and Sea-born. In this post, I will discuss about some basic plots that can be created using the Matplotlib library and at the end of this post, I will discuss some graph customization techniques using Matplotlib.

So let’s get started …

Before we begin writing any Python code, we will have to make some necessary imports…

Since we have our imports now, we are all set to import the dataset in Python.

In the above code, read_csv is a Python function to read any .csv file and data is a Python dataframe in which the file’s data will be stored.

Let’s now check , what our dataset actually contains. To do this , lets print out the first 5 rows of our dataset using Python’s head function :

This is predominantly a clean dataset ,but might involve some other data preprocessing steps, but that’s not the focus of this post. So let’s now begin to visualize some of this data using some of the plots in Python’s Matplotlib library.

Scatterplot

Scatterplot shows data as a collection of points and is one of the plots used to detect outliers in the data i.e. abnormal observations in the data. Its up-to the individual visualizing the data to set the definition of what he/she considers as an abnormal observation.

Let’s have a look at the parameters of this function :

x — The variable/dimension to be plotted on the x axis

y — The variable/dimension to be plotted on the y axis

data — The Python data frame into which the dataset is imported.

There are numerous other parameters in the scatter function that can be used to customize the plots. But before we move onto the next plot , let us set the labels for x and y axis. These couple of lines of code below will help us to achieve that ..

The above visualization gives us a scatter plot between Reviews/Month and Total number of Reviews .

Now, we will move onto our next plot — the histogram.

Histogram

Histogram is used to visualize frequencies of some variables in the data. This is another plot that is commonly used to detect outliers in the data .

First , we will change the range for x and y axis so that we are able to visualize the plot better.. These few lines of code will allows us to change the range for the axis and plot the histogram.

The xlim function takes lower x value and upper x value to set as the range for x axis, and similarly , the ylim function takes the lower y value and upper y value to set as the range for y axis.

The hist Matplotlib function allows us to create a histogram and we have various options to customize this plot. Again , as above , we set the labels for x and y axis.

This histogram provides us with the frequencies of the variable Minimum Nights.

Bar Graph

A bar graph is used to represent categorical data using rectangular bars .The height of these rectangular bars is proportional to the values represented by them.

Below is some python code to create a bar graph..

Let’s review the above code :

Persons— A python list consisting of names of three different persons .

Heights — Again, a python list consisting of numerical heights of persons that will constitute the height of the rectangular bars.

The bar function allows us to create bar graphs in python.

Line Graph

A line graph displays data as a series of points connected by line segments. We can plot any two numeric variables to create a line chart. Here is some python code for creating a line chart ..

xcoord and ycoord both are numerical python lists. The plot function is used to create line charts.

Box Plot/Whisker Plot

The most commonly used plot to detect outliers in data is Box plot , also called as whisker plot. It is used to plot groups of numerical data and consists of three quartiles — Q1 (lower quartile — represents 25% of data), Q2(Median — represents 50% of data) and Q3(upper quartile — represents 75% of data).

It’s called as ‘Box plot’ because the data and the three quartiles are represented within a box. The two vertical lines extending from the ends of the box are called whiskers and hence its name ‘Whisker Plot’.

The points that lie outside of the two whiskers are considered as ‘outliers’.These few lines of code will create a box plot.

The function to create box-plots in python is boxplot and here we have passed a python list as a parameter to it.As can be seen from the above plot, the data point with the value 10000 will be considered as an outlier.

Some Customization Techniques

There are numerous ways in which we can customize our Python plots. I will discuss few of those here in this section..

Suppose we want to plot a scatter plot as well as a histogram , but not on the same axis range . The below code will help us to achieve that ..

Since we have our desired output now, let’s now review the code above .

axes function takes a python list with the below four parameters :

1st parameter — lower x value of the axis.

2nd parameter — lower y value of the axis.

3rd parameter — width of the axis.

4th parameter — height of the axis.

And the rest of the functions used are as discussed in the above sections.

Another function that takes a python tuple and can be used to set the minimum and maximum range for x and y axis is plt.axis( (min. x value, max. x value, min. y value, max. y value) )

Specifying the parameters of the axis each time manually is a lot of manual task 😒 Isn’t it ??

No worries 😊, Python has given us an alternative for this .. Let’s have a look at it below :

The python subplot function comes to our rescue …

Here’s the code ..

The subplot function basically creates a grid and takes in 3 parameters :

1st parameter — The number of rows in the grid

2nd parameter — The number of columns in the grid

3rd parameter — The sequence number of the subplot to plot.(The numbering starts from the top left corner of the grid)

So, the plt.subplot(2,1,1) statement creates the first subplot in the grid.

The last statement plt.tight_layout() statement ensures that uniform spacing is maintained between the plots.

Now, suppose we want to give a colour and a legend to our histogram plot, specifying these few statements will help us to do that ..

The color parameter in the hist function allows us to specify the colour and the label parameter allows us to specify the text of the legend.

To specify the location of the legend in the graph, we can use the legend function with the loc parameter. The loc parameter can take various values like upper-right , lower-right, upper left and so on…

Finally to save our Python plots, we can use the plt.savefig(“Full path”) function, which takes the full path of the location to save the plots as an image.

A LAST NOTE — Some python editors require us to explicitly specify a call to show Matplotlib function to make our plots visible, so in case you are using any editor where you are not able to view your plots after writing the code, specify the plt.show() statement.

So that’s it for this post. In the next post , we will see how to create some advanced plots using Matplotlib and how to use Python’s Sea-born library to create some interactive visualizations..

--

--

Simarpreet Singh
Analytics Vidhya

Data Analyst & Visualization Specialist | Masters in Computer Science | Concordia University | Montreal,Canada