Visual Exploratory Data Analysis | Pandas Foundation | Part 1

Muhammad Zohaib
Nov 20 · 4 min read
Image for post
Image for post
Image source: RDRR.IO

As a discipline, statistics has mostly developed in the past century. — in the context of data science and big data.This article focuses on the first step in any data science project: exploring the data. Exploratory data analysis, or EDA, is a comparatively new area of statistics. Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples.

In 1962, John W. Tukey called for a reformation of statistics in his seminal paper “The Future of Data Analysis”. He proposed a new scientific discipline called data analysis that included statistical inference as just one component.

Knowing now how to inspect data in pandas, let’s discuss exploratory data analysis or EDA. We’ll apply EDA visually first. Will consider Fishers Iris flower data. There are 150 observations with 4 measurements from each. There are also 3 distinct flower species.We start with importing pandas and matplotlib pyplot.We load the data frame and examine it with head.

Image for post
Image for post

Line Plot

First, let’s draw some data with the data frame plot method. We can specify particular column names for the X & Y axes.

Image for post
Image for post
Image for post
Image for post

The result is not all that useful because by default plot creates a line plot for this unordered 4 dimensional data set. A line plot doesn’t make a lot of sense.

Scatter Plot

Will try again, this time specifying kind equals scatter.Let’s also label the axes with units of measurement.The scatter plot is better.

Image for post
Image for post
Image for post
Image for post

Box Plot

Individual variable distributions are likely more informative than plotting two variables against each other. For instance, specifying kind equals box for, say, the sepal length makes a boxplot.

Image for post
Image for post
Image for post
Image for post

The resulting boxplot shows the range, the minimum and maximum values with the whiskers, the interquartile range with the box edges, and the median inside.

Histogram

Another common plotting tool for EDA is a histogram.Here we use kind equals hist.Recall that histograms show frequencies of measurements counted within certain bins or intervals.The result approximates a probability distribution function or PDF of the sepal length of all the iris Flowers. It could be Bell shaped, but it’s hard to tell with bins this wide.We can redraw the histogram to get a better sense of the data.

Image for post
Image for post
Image for post
Image for post

Histogram Option

Other hist arguments can be passed using the data frame plot interface. For instance, bins is the number of intervals to use in building. The histogram range gives the extremes of the bins. Normed tells whether to rescale counts to add to one.Cumulative tells whether to draw the histogram or its cumulative distribution function. In fact, any matplotlib options can be specified using keyword arguments with plot.

Histogram Customization

Using more keyword options improves the histogram. Here we specify 30 bins and a range from 4 to 8.The customized histogram shows at least three distinct piece in the distribution of sepal lengths. This suggests groups or subpopulations in the data.

Image for post
Image for post
Image for post
Image for post

Check the documentation for details.here is Github link to notebook(here!)

The Startup

Medium's largest active publication, followed by +732K people. Follow to join our community.

Muhammad Zohaib

Written by

I’m Data Science student.I love to create, learn and share my skills. learning a new technology, brushing up on current skills or writing Data Science articles.

The Startup

Medium's largest active publication, followed by +732K people. Follow to join our community.

Muhammad Zohaib

Written by

I’m Data Science student.I love to create, learn and share my skills. learning a new technology, brushing up on current skills or writing Data Science articles.

The Startup

Medium's largest active publication, followed by +732K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store