As a discipline, statistics has mostly developed in the past century. — in the context of data science and big data.This article focuses on the first step in any data science project: exploring the data. Exploratory data analysis, or EDA, is a comparatively new area of statistics. Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples.
In 1962, John W. Tukey called for a reformation of statistics in his seminal paper “The Future of Data Analysis”. He proposed a new scientific discipline called data analysis that included statistical inference as just one component.
Knowing now how to inspect data in pandas, let’s discuss exploratory data analysis or EDA. We’ll apply EDA visually first. Will consider Fishers Iris flower data. There are 150 observations with 4 measurements from each. There are also 3 distinct flower species.We start with importing pandas and matplotlib pyplot.We load the data frame and examine it with head.
First, let’s draw some data with the data frame plot method. We can specify particular column names for the X & Y axes.
The result is not all that useful because by default plot creates a line plot for this unordered 4 dimensional data set. A line plot doesn’t make a lot of sense.
Will try again, this time specifying kind equals scatter.Let’s also label the axes with units of measurement.The scatter plot is better.
Individual variable distributions are likely more informative than plotting two variables against each other. For instance, specifying kind equals box for, say, the sepal length makes a boxplot.
The resulting boxplot shows the range, the minimum and maximum values with the whiskers, the interquartile range with the box edges, and the median inside.
Another common plotting tool for EDA is a histogram.Here we use kind equals hist.Recall that histograms show frequencies of measurements counted within certain bins or intervals.The result approximates a probability distribution function or PDF of the sepal length of all the iris Flowers. It could be Bell shaped, but it’s hard to tell with bins this wide.We can redraw the histogram to get a better sense of the data.
Other hist arguments can be passed using the data frame plot interface. For instance, bins is the number of intervals to use in building. The histogram range gives the extremes of the bins. Normed tells whether to rescale counts to add to one.Cumulative tells whether to draw the histogram or its cumulative distribution function. In fact, any matplotlib options can be specified using keyword arguments with plot.
Using more keyword options improves the histogram. Here we specify 30 bins and a range from 4 to 8.The customized histogram shows at least three distinct piece in the distribution of sepal lengths. This suggests groups or subpopulations in the data.
Check the documentation for details.here is Github link to notebook(here!)