Introduction to data visualization with Pandas

Pankajashree R
We Are Orb
Published in
5 min readDec 6, 2017

A picture is worth a thousand words.

Today, the amount of data being generated every second is appalling. Representing data in pictures, charts and graphs facilitates comprehension and provides more insight. We all prefer to look at a picture rather than a big paragraph of text or a long table of text and numbers. This is why data visualization has become an important field today.

Image source

Pandas is a python library useful for data cleaning, modeling and exploration. Quoting from the official doc —

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In this article I’ll demonstrate the basics of visualizing data using this library.

Prerequisites

  • Familiarity with python
  • Programming environment —Anaconda and Jupyter notebook are the best applications to experiment with Data analysis. You can even try it temporarily here. Anaconda already comes with Pandas installed.

Steps involved in data visualization

Data visualization is a fancy word which essentially comprises only of these 3 basic steps:

  1. Importing the required libraries (such as matplotlib, seaborn, etc)
  2. Getting the data ready — normally reading from a csv file or a json data and creating a table ( dataframe is the technical word)
  3. Graphical representation — Plotting — choose the type of plot and see the magic!

Step 1: The libraries

Pandas visualization based on matplotlib API can be used to create decent plots such as bar graphs, histograms, scatter plots, etc. There are other advanced visualization libraries such as seaborn, bokeh, etc for advanced techniques such as 3D modelling, live-streaming graphs, maps, etc.

Let’s first master matplotlib and then move on to the advanced libraries.

Open Jupyter notebook and start your data-viz program. For any 2D data-plot, we need to import these 2 packages customarily:

import pandas as pd

import matplotlib.pyplot as plt

matplotlib.pyplot is the package required for the generation of 2D plots.

Step 2: The Data set

For starters, let us use a sample data set consisting of people’s first name, their country and their ages, all stored in a csv file.

Our sample data set — csv file opened in Excel

We have to read this csv file and store it in adataframe which is a table consisting of indexed rows which are known as series.

df = pd.read_csv('people-example.csv')
Viewing the dataframe in Pandas — Notice that it is similar to the Excel image above.

You can see your table (i.e., dataframe) using head() method to see the first 5 rows and tail() method to see the last 5 rows in the ipython notebook. Pressing shift+enter executes the line of code.

Step 3: Plot the data

Before we plot our data, using%matplotlib inline will display the plots directly below the code cell that produced it.

plot() method creates a plot of dataframe, a line graph by default. By default it takes the serial numbers as the x-axis and age as y-axis.

x and y axis labels can be specified like so:

df.plot(x='Last Name', y='age' )

We can specify other types of plots such bar, horizontal bar (barh), histogram, etc. For example, df.plot.bar() will display a bar graph for the dataframe.

Exercise: Using a bigger dataset and an advanced plotting technique

Let’s now use a bigger dataset to have some real fun. Iris dataset is a dataset that contains characteristics of 3 types of iris plant. It is a common dataset used in experimenting with data analysis. To know more about the data set, visit here. Using the same 3 step-procedure explained above, lets prepare the dataframe:

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import andrews_curves
df2 = pd.read_csv('iris.csv')

For visualizing multidimensional data (data consisting of many parameters) such as this Iris dataset, andrews curve , and parallel-coordinates are common techniques. For this we need to import andrews_curve (or parallel_coordinates) frompandas.plotting package additionally.

In case of Andrews curve, rows of data are grouped according to our desired parameter. For example, when we do andrews_curves(df2, 'Name') the rows of df2 are grouped by the value of Name.

Andrews plot for the Iris dataset

What is the specialty of this plot?

The Iris dataset has 3 different types of plants. That’s why we get three different colors for the lines.

By coloring these curves differently for each class it is possible to visualize data clustering. Curves belonging to samples of the same class will usually be closer together and form larger structures.

Therefore, in this plot, we can easily note that the lines that represent a class (or a type of the plant) are closely spaced and have similar curves.

Code Repository

You can find the code and datasets used in this article here.

Where to go from here — Useful Resources:

Panda’s official doc for visualization — See all other different plots that can be created using pandas.

Kaggle’s datasets — The best place to find open data. Go on and start exploring and publish your results!

Questions? Comments? Leave a note below.

Data Cleansing is important before you visualize your data. Check out this awesome post from our publication to get started with Data Cleansing.

If you found this article helpful please share it and follow We Are Orb™ for more awesome articles on data science. Thank you.

--

--