Day (6) — Data Visualization — How to use Pandas Built-In features

Keith Brooks
4 min readMar 6, 2018

--

This article covers work from the Python for Data Science and Machine Learning Bootcamp course on Udemy by Jose Portilla and helpful tips along the way. This course was very helpful in gaining a base understanding of the topic.

Has the phrase “Ain’t nobody got time for that” ever come across when trying to review data? There are times when there is so much that needs to be done that a quick data visualization will do. But, what should/ could be used. Well, similar to “There’s an app for that”…Pandas has a set of built-in data visualization features that provides some quick and dirty plots to assess datasets.

Topics:
* How to generate histograms
* How to generate area plots
* How to generate bar plots
* How to generate line plots
* How to generate scatter plots
* How to generate box plots
* How to generate hexbin plots
* How to generate Kernel Density Estimation plots

The Setup:
* The example uses Python 3.6within Jupyter notebook with the below dependencies
Matplotlib 2.1.2
Numpy 1.14.1
Pandas 0.20.3

Warning:
* Feel free to review the docs for additional arguments for the methods.

Based on the recent exposure to the seaborn library, it feels as though the documentation for the pandas plots may not be as detailed…At least that is my opinion. That being stated, seaborn is typically recommended for data visualization, but the below plots will do a quick analysis. Given that most of Pandas plots can be generated off of just the data frame in question, this can be very helpful. Well, let’s get into it. There are two ways to generate plots…

Option A

data_frame[‘column_name’].plot(kind=’plot_type_name’)

Option B — recommended

data_frame[‘column_name’].plot().plot_type_function()

Histogram plot -> used to display the distribution of a numerical dataset
~
Use the .hist() method with the bins argument to generate a histogram plot. The bins specify the amount of buckets we want to allocate to split the data into for review.

Area plot -> used to display quantitative
~
Use the .area() method to generate an area plot. The area plot may be thought of as a line plot with the area underneath filled in.

Bar plot -> used to display categorical data
~
Use the .bar() method with optional argument stacked to display data in columns that are positioned on top of one another.

Line plot -> used to display a frequency of data along a line segment
~
Use the .line() method with arguments x(index column), y(string value), figsize(to adjust display size — optional), and lw(adjust line width — optional).

Scatter plot -> used to display data points on horizontal and vertical axes
~
Use the .scatter() method to display two and three levels of information from a dataset. The method accepts arguments for x, y, c(color of data points) and cmap (to alter plot colors).

Box plot -> used to display distribution, central value and variability of data
~
Use the .box() method to generate basic box plots

Hexbin plot -> essentially scatter plots displayed as hexagons based on data point density
~
Use the .hexbin() method to gain insight on the density of data from a alternative perspective. The data at the blue end of the spectrum is less dense than that of the red. Or, there were more data points located around the red hexagon than that of the blue areas.

KDE (Kernel Density Estimation) plot -> used for data smoothing and to display the density distribution of the data
~
Still need more exposure with reasons to use kde. However, we can use the .kde() method to generate the plot

Enjoy!

“Very little is needed to make a happy life; it is all within yourself, in your way of thinking.” ~ Marcus Aurelius

--

--

Keith Brooks

Hi…This is intended to document my journey on my path to data science. I am a process oriented individual who loves to try new things and continue to learn.