First step to Statistics (with Iris data)

Nilanjana Mukherjee
Analytics Vidhya
Published in
14 min readJun 1, 2020

Table of Contents

  • Introduction
  • Statistical Nomenclatures
  • Descriptive Statistics
    – Measures of central tendency
    – Measures of spread
  • Exploratory Data Analysis (EDA)
    – Box Plot
    – Swarm Plot
    – Histogram
    – Kernel Density Estimation (KDE) Plot
    – Violin Plot
    – Empirical Cumulative Distribution Function (ECDF) Plot
    – Pair Plot
    – Quantifying the correlation between two variables
    – Plot summary
  • Normal Distribution
    – Comparison between the Normal CDF and the ECDF of the Iris data

Introduction

Statistical skills are now more in-demand than ever. If you are involved in collecting and analyzing data of any kind, then understanding of the basic statistical concepts may be a necessity for you. This article aims to informally introduce few statistical concepts by exploring the well-known Iris data set.

There are plenty of statistics courses available online which are great for a beginner. I will list some of my favorites at the end of this article. But the drawback of a lengthy course is that it is easy to lose focus when the topics get involved. Here, my intent is to introduce some basics that will make you feel at ease.

A basic understanding of python libraries is needed to understand the codes used here.

Here are the list of topics covered in this article:

  1. Descriptive Statistics
  2. Exploratory Data Analysis (EDA)
  3. Normal Distribution

First, we will introduce some statistical nomenclatures and then use the Iris data set to illustrate the underlying concepts.

Iris data: The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.(source: wiki).

Here is how few rows of this data set look:

Statistical Nomenclatures

Let us start with the difference between population and sample. A population includes all of the elements from a data set whereas a sample consists one or more observation from the population. Samples are used to make statistical inferences about the population.

Now let us introduce the two broad types of statistical methods:

  1. Descriptive Statistics
  2. Inferential Statistics

Descriptive Statistics describes/summarizes the data but is not used for making any generalizations about a population. For example, what are the average petal length and sepal length for different species of Iris?

Inferential Statistics allow generalizations about the population based on samples from that population. It does so by relying on probability theory. For example, using our Iris sample, can we test a hypothesis that long sepals have long petals and vice versa?

Before drawing any inference from the data, it needs to be visualized and analyzed using Descriptive Statistics and Exploratory Data Analysis (EDA).

EDA is the exploration of data for identifying patterns or outliers. In Data Science, EDA is performed before the modeling stage for detecting anomalies/outliers , for identifying influential variables, and for gaining insights to the underlying distribution. EDA can be considered as a prerequisite for applying formal statistical techniques on the data. Both graphical and non-graphical techniques are used for performing EDA.

This blog is limited to Descriptive Statistics and EDA graphical techniques; we will not cover Inferential Statistics here.

Descriptive Statistics

Descriptive statistics summarizes the data using the following two methods:

  1. Measures of central tendency
  2. Measures of spread

1. Measures of central tendency:

There are three basic ways to quantify this:

  • Mean: Average value of a data set. For instance, the data set [1,2,5,5,2] has mean (1+2+5+5+2)/ 5 = 3.
  • Median: Middle value of the data set. To find the median, the data needs to be arranged from the least to the largest in magnitude. If there are even number of items in the data set, then median will be the mean of the two middle numbers. For instance, the data set [1,2,5,7,9] has median 5. For the data set [1,2,4,5], the median is (2+4)/2 = 3.
  • Mode: The item that appears the most. For instance, the data set [1,2,5,5,6,5,9,5] has mode 5.

Mean is not robust to statistical outliers. If some data points are much higher or lower than the other values, then the mean will be affected by those. On contrary, the median is robust to statistical outliers as it is just the middle value.

2. Measures of spread

Here are the ways to quantify this:

  • Range: Difference between the largest and the smallest values. For instance, the range of the dataset [1,2,5,7,9] is (9–1) = 8.
  • Variance: Average of the squared differences from the mean. For the data set [1, 2, 3], the mean is 2 . The variance for this data set is [(1–2)² + (2–2)² + (3–2)²]/3 = 0.667.
  • Standard Deviation: Average distance of the data points from the mean. This is equal to the square root of the variance. A large standard deviation means that the data points are more spread out while a small standard deviation means that the data points tend to be closer to the mean. The standard deviation is zero if all entries in the data set have the exact same value.
  • Quartile, Quantile and Percentile: They divide the data as follows.

Exploratory Data Analysis (EDA)

Now that we have some idea about the key measurements from descriptive statistics, we will visualize them on the Iris dataset using different graphical techniques or plots.

We describe the following types of plots encountered often:

  • Box Plot
  • Swarm Plot
  • Histogram
  • Kernel Density Estimation (KDE) Plot
  • Violin Plot
  • Empirical Cumulative Distribution Function (ECDF) Plot
  • Pair Plot

In the Iris data, we have 4 numerical variables and 1 categorical variable.

  • Numerical Variable: Sepal length/width & Petal length/width.
  • Categorical Variable: ‘species’ column with 3 categories (Iris setosa, Iris virginica and Iris versicolor).

1. Box Plot (or box-and-whisker plot)

The first plot we will explore is Box Plot. It summarizes the key statistical measures as follows.

The box goes from the first quartile(Q1) to the third quartile(Q3). The vertical line within the the box indicates median value. The two whiskers (lower 25% of scores and upper 25% of scores) go from the quartile to the minimum and the maximum values.

Interquartile Range (IQR): Middle 50% of scores, which is the range between the 25th and 75th percentile.

Outliers: If a data point is below (Q1–1.5×IQR) or above (Q3 + 1.5×IQR), it is considered to be far from the central values and thus termed as outliers.

Pic ref: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Iris data Box Plot 1: Comparison between the 3 categories for each numerical variable.

X axis: Species Categories, Y axis: Numeric variable values in cm.

Code link : here

We should be able to read the summary statistics (median, quartiles, IQR, maximum and minimum) from the above plots. Iris virginica is larger and wider than all other species except for the sepal width. There is an outlier for Iris virginica sepal length near 5 cm.

Iris data Box Plot 2: Comparison between the 4 numeric variables for each category.

X axis: Numeric variables, Y axis: value in cm.

Code link : here

For Iris setosa, the petal lengths are smaller than the sepal widths.

Box plots can also help us to identify skewed data (we will discuss that with the next plot).

Limitations: The box plots are focused on summary rather than detail. Thus, exact data points cannot be seen from the plot.

2. Swarm Plot (or Beeswarm)

Swarm plot is a type of categorical scatter plot. It shows the distribution of the data points adjusted along the categorical axis.

Combining the swarm plot with the box plot will give a summary view as well as the data distribution.

Code link : here

Notice that we do not have any observation for Iris virginica sepal lengths between 6.6 cm and 6.7 cm.

Sepal lengths of both Iris virginica and Iris versicolor are slightly left skewed, that is, low values are frequent and tail is on higher value side which makes the mean greater than the median (we will discuss this in detail with histogram).

Limitations: The swarm plot does not scale well for large datasets since it plots all the data points.

3. Histogram

Histograms plot the frequency of occurrence of numeric values for a variable.

A histogram groups numbers into bins. The height of the bin is the frequency of that bin. A higher bar represents more observations per bin.

Here we have plotted histograms for the sepal length variables of each of the species categories.

Code link : here

Before explaining the plots above, let us revisit the notion of skewed data. There can be two types of skew:

Positive/right skewed distribution:
— Low values are frequent
— Tail is on the higher value side
— Higher value tail will make the mean greater than the median

Negative/left skewed distribution:
— High values are frequent
— Tail is on the lower value side
— Lower value tail will make the mean smaller than the median

Let us now revisit the skewness in the Iris dataset via histograms. From the above histograms, we see that the sepal lengths of both Iris virginica and Iris versicolor are slightly left skewed. For a different example of skewed data, see here.

Limitations: Histograms can be plotted with different number of bins. However, the shape of a histogram changes with the choice of the number of the bins, and may make it harder to interpret (the ECDF plot described later is a better alternative).

It is possible to plot stacked histograms as shown below. However, they may be difficult to read and compare (for this purpose, the box plot is a better solution).

Code here: link

4. Kernel Density Estimation (KDE) Plot

Before talking about the KDE plot, let us introduce the concept of a random variable. A random variable can take any value from a set of possible values.

The random variables can be of two types:

  • A continuous random variable can take any of the infinite number of possible values in a range. In other words, the set of possible values is an infinite set. Here, the length and the width variables are continuous random variables.
  • A discrete random variable can take any of the finite number of possible values in a range. In other words, the set of possible values is a finite set. Example: the outcome of rolling a six-sided die can take values in the finite set [1,2,3,4,5,6]. The value cannot be 1.2 or 1.3 etc.

Kernel Density Estimation (KDE) is a way to approximate the Probability density function (PDF) of a continuous random variable.

Probability density function is a deterministic function that describes the probability of observing a value of a continuous random variable. The area under the PDF curve is equal to one. The probability that a continuous random variable takes value within a given interval [a,b] equals to the area under the curve between a and b.

Because a continuous random variable can take any of the infinite number of possible values, the probability of the event that the variable is exactly equal to one particular value is zero.

Below is a KDE plot for the Iris virginica petal length.

Code here : link

The total area under the above curve is equal to 1. The probability of a value to be in a range will be the area under the curve in that range. For example, the area under the curve between 7 cm and 7.5 cm is very small, and therefore the probability of a value being between 7 cm and 7.5 cm will be very low.

For a discrete random variable, the probability mass function (PMF) takes the role that the PDF plays for a continuous random variable.

5. Violin Plot

Violin plot is a combination of box plot and KDE plot discussed before. It includes a kernel density estimation of the underlying data distribution.

The violin plot can show all the key summary statistics as in the box plot. It can also show the sections where the probability of the values will be higher/smaller. The higher probability sections will be wider while the smaller probability sections will be narrower. This means that multimodal distributions can be visualized in a violin plot, but not in a box plot.

Here is the violin plot on Iris data sepal length across different categories.

Code link : here

  • The white dot is the median
  • Middle thick bar is the interquartile range
  • The two thin lines are the whiskers we have seen before in the box plots
  • The surrounding curve is the kernel density estimation of the underlying distribution

Limitations: the violin plot may look misleadingly smooth for a small dataset.

6. Empirical Cumulative Distribution Function (ECDF) Plot

The ECDF may seem complicated at the beginning but with some practice, you will start appreciating the value it delivers. It might very well be your first plot for data visualization.

Before going into any explanation, let us right away plot the ECDF of the petal_lengths for different species.

Code link : here

In the above, we plotted three Empirical Cumulative Distribution Functions (ECDFs) from the petal_length data. Each dot represents one petal length value.

Here, the word ‘empirical’ means that it is based on data. The Cumulative Distribution Function (CDF) shows for any given value what is the percentage of data points that are equal to or below that threshold.

Let us illustrate how we can read an ECDF plot.

Question: What percentage of Iris virginica petal length is 6 cm or less?

Answer: Find the ECDF for Iris virginica. Look for the X axis value 6 cm for that ECDF. Now look for the corresponding Y axis value. We see that the Y axis value is slightly above 0.8. So for our dataset, approximately 82% of Iris virginica petal lengths are equal to or less than 6 cm.

We can read the min/max, range and quartiles (0.25,0.5 and 0.75 values in the Y axis) from the ECDF plots. However, the ECDF plot is not a summary plot. It plots all the data points along with the summary values. An ECDF plot can be more effective than a histogram plot since there is no binning bias.

Here is a link where you can see other advantages of ECDF plot: Link

5. Pair Plot

Pair plot shows pairwise relationships in a data set. An example below.

Code link : here

In the above, the pair plot is showing:

  • Scatter plots (shows the relationship between two variables) for non diagonal plots
  • Histograms for the diagonal plots

For other variations of the pair plot see here.

Quantifying the Correlation between Two Variables

Correlation measures the degree to which two variables are linearly related. Pearson’s correlation coefficient can be used to assign a numeric quantity for the same which can take any value between -1 to +1.

  • Correlation coefficient 0 means that there is no linear relationship between the variables
  • Correlation coefficient -1 means that the data points lie on a straight line with negative slope of -1
  • Correlation coefficient +1 means that the data points lie on a straight line with positive slope of +1

Here is a heat map of the Pearson correlation coefficients between the Iris variables:

Code link : here

The variables with coefficient values close to +1 and -1 are highly correlated. For example, the petal lengths and the petal widths have high positive correlation coefficient (0.96). So for all the species, if the length of a petal is big/small, its width will also be proportionately big/small.

Below is a summary of all the plots that we discussed.

Summary of the plots

Normal Distribution

Sometimes data can have a bell shaped (unimodal) distribution with no specific bias to the left or to the right, even though the data is non-uniform. This type of distribution is known as the Normal Distribution.

Here are some facts about normal distribution:

  1. The distribution is symmetric: its mean, median and mode have the same value. Half of the data will be less than the mean while the other half will be more than the mean.
  2. If for a normal distribution, the mean = 0, and the standard deviation = 1, then it is called the standard normal distribution.
  3. Many natural processes follow the normal distribution which makes this distribution so popular. Because the distribution is so commonplace, it is called “normal”.
  4. The distribution is also referred to as the Gaussian distribution.

For a normal distribution:

  • About 68% of all data values will fall within +/- 1 standard deviation of the mean.
  • About 95% of all data values will fall within +/- 2 standard deviations of the mean.
  • About 99.7% of all data values will fall within +/- 3 standard deviations of the mean.

This ‘68–95–99.7’ is known as Empirical Rule for the normal distributions. Here is a visual for that (in the plot below, the symbol µ denotes the mean of the distribution).

Pic ref: https://www.softschools.com/math/probability_and_statistics/the_normal_distribution_empirical_rule/

Comparison between the Normal CDF and the ECDF of the Iris data

The last topic will be a comparison between the Normal CDF and the ECDF for the petal_length variable of iris_virginica.

Look at the ECDF section above for more explanation about the ECDF plot in general.

What is the difference between Normal CDF and ECDF?

E(Empirical)CDF is drawn from the data whereas the normal CDF is generated from a theoretical Normal distribution.

We will overlay the theoretical normal CDF with the ECDF for the petal_length of iris_virginica. This comparison gives an idea about how our sample is distributed compared to a theoretical Normal Distribution with the same mean and standard deviation as the petal_length of iris_virginica. In other words, we will check if a normal distribution can serve as a good approximation for our data distribution.

First let us visualize the CDF for a normal distribution.

In the above plot, the mean equals 5.552, the standard deviation equals 0.55. Here, we plotted a normal CDF with the above mean and standard deviation computed from the data. The plot is symmetric about the mean 5.552.

Now we will plot the ECDF of the petal_length for iris_virginica superimposed on the previous plot.

Code link : here

In the above plot, the blue curve is the normal CDF that we plotted before, the red dots are for the ECDF of the petal_length for iris_virginica. Here, the ECDF seems to match very well with the normal CDF.

I hope you liked this article. Please feel free to reach out to me with any constructive feedback. Here are few introductory statistics courses that I liked:

--

--