Why ECDF is better than a Histogram

Shoaibkhanz
convergeML
Published in
6 min readJan 20, 2019

Every exploratory analysis always has a histogram, we seek answers but histograms are less informative and biased. This may break the hearts of many who have histograms as their first weapon to perform EDA.

In this post, we will learn to draw a histogram and an ecdf using python, and then we will explore why ecdf is a better choice as a first visualization. We will use iris dataset to draw the histogram of setosa’s petal length.

Import all the required libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

Get the iris data and convert it into a dataframe.

data = load_iris()
iris = pd.DataFrame(data= np.c_[data['data'], data['target']],
columns= data['feature_names'] + ['target'])

Subset the data to setosa species and draw a histogram of its petal length with 6 bins.

#subsetting
setosa = iris[iris['target'] == 0]
#histogram
sns.set_style('whitegrid')
_ = plt.hist(setosa['petal length (cm)'], bins = 6)
plt.title('Histogram')
plt.xlabel('setosa petal length (cm)')
plt.ylabel('counts')
plt.show()

You can see the histogram from the code above. Please notice the structure of the histogram has, the bar heights are higher between the interval 1.3 to 1.7.

Imagine the same histogram with 5 bins and notice how the distribution has changed. The way you bin your data can alter the way to understand it. This is called Binning Bias.

What’s the Alternative?

ECDF stands for empirical cumulative distribution function, which you should use more often to understand your data. The algorithm to build an ecdf is illustrated in the code. The output of the custom ecdf function, when applied to setosa dataframe contains 2 arrays or vectors.

  • The 1st array(e.g. setosa_sort) contains the petal length in ascending order.
  • The 2nd array(e.g. setosa_percentiles) contains the percentiles for the respective value of petal length in setosa_sort.
#custom function for ecdf
def empirical_cdf(data):
percentiles = []
n = len(data)
sort_data = np.sort(data)

for i in np.arange(1,n+1):
p = i/n
percentiles.append(p)
return sort_data,percentiles
#use the function on the setosa iris data
setosa_sort, setosa_percentiles = empirical_cdf(setosa['petal length (cm)'])

We now use the setosa_sort and setosa_percentiles ndarrays (numpy n-dimensional arrays) to plot the ecdf, the sorted setosa data goes as the x-axis and their respective percentiles go as an input to the y-axis.

_ = plt.plot(setosa_sort,setosa_percentiles, label='Data',color = 'blue')
plt.title(r'$\bf{Empirical \ CDF}$')
plt.xlabel('setosa petal length (cm)')
plt.ylabel('percentiles')
plt.show()

The plot above is an ecdf for setosa petal length. We may observe from the plot that the median(50th percentile) is around 1.5 cms, histograms don't tell you that. We are going to make the above ecdf a little more detailed.

#data preparation for plotting, capturing important statistics
ymedian = np.empty(len(setosa))
ymedian.fill(0.5)
ymax = np.empty(len(setosa))
ymax.fill(np.max(setosa_percentiles))
ymin = np.empty(len(setosa))
ymin.fill(np.min(setosa_percentiles))
y25 = np.empty(len(setosa))
y25.fill(0.25)
y75 = np.empty(len(setosa))
y75.fill(0.75)
fig,ax = plt.subplots()# Plot the data
setosa_ecdf = ax.plot(setosa_sort,setosa_percentiles, label='Data',color = 'blue')
# Plot the lines
max_line = ax.plot(setosa_sort,ymax, label='Max', linestyle='--',color = '#001c58')
y75_line = ax.plot(setosa_sort,y75, label='75Percentile', linestyle='--',color = '#ff7a69')
med_line = ax.plot(setosa_sort,ymedian, label='Median', linestyle='--',color = '#fb6500')
y25_line = ax.plot(setosa_sort,y25, label='25Percentile', linestyle='--',color = '#4fb4b1')
min_line = ax.plot(setosa_sort,ymin, label='Min', linestyle='--',color = '#ffbb39')
#annotate lines
ax.annotate('Max/$100^{th}$ Percentile',xy = (1,0.96),color = '#001c58')
ax.annotate('$75^{th}$ Percentile',xy = (1,0.71),color = '#ff7a69')
ax.annotate('Median/$50^{th}$ Percentile',xy = (1,0.46),color = '#fb6500')
ax.annotate('$25^{th}$ Percentile',xy = (1,0.21),color = '#4fb4b1')
ax.annotate('Min/$1^{st}$ Percentile',xy = (1.3,0.04),color = '#ffbb39')
#final and important bits
plt.tight_layout()
plt.title(r'$\bf{Empirical \ CDF}$')
plt.xlabel('setosa petal length (cm)')
plt.ylabel('percentiles')
plt.show()

Observe, how we have plotted all the important statistics from the data. This sort of statistics you would have seen in a box and whisker plot.

Outliers in Histogram & ECDF

We are going to generate some random data with an outlier and visualize it on the histogram and on the ecdf.

#data with outlier
outlier_data = np.random.random(100)
outlier_data = np.append(outlier_data,4)
#plotting with an outlier
plt.figure(figsize=(18,10))
plt.subplot(1,2,1)
_ = plt.hist(outlier_data)
plt.xlabel('random values')
plt.ylabel('count')
plt.title(r'$\bf{Histogram \ with \ an \ Outlier}$')
plt.subplot(1,2,2)
d, p = empirical_cdf(outlier_data)
plt.plot(d,p)
plt.xlabel('random values')
plt.ylabel('percentiles')
plt.title(r'$\bf{ECDF \ with \ an \ Outlier}$')
plt.show()

You can see how the outlier can also be spotted in an ecdf as well. Since ecdf does not depend on bins, its structure doesn't change like histograms and it still captures the outliers. I have plotted ecdf to be a line, if we visualize them as points, you will clearly see that the only value after 0.77 is 4 (which is an outlier in this case).

Try yourself: use marker = 'o' to visualize the ecdf as points instead of lines.

Bimodal Data

We will generate bimodal data and then visualize it with a histogram and an ecdf. In fact, the reason the data is called bimodal is that it has 2(bi) peaks(modes). If you build a histogram you will see the 2 peaks, while the question is can we detect it in ecdf?

#generate bimodal data
bimodal = np.append(np.random.normal(0, 1, int(0.5 * 10000)),
np.random.normal(5, 1, int(0.7 * 10000)))
#histogram of bimodal data
plt.figure(figsize=(18,10))
plt.subplot(1,2,1)
_ = plt.hist(bimodal)
plt.xlabel('random normal values')
plt.ylabel('count')
plt.title(r'$\bf{Histogram \ of \ a \ Bimodal \ data}$')
#ECDF of a bimodal data
plt.subplot(1,2,2)
bm_d,bm_p = empirical_cdf(bimodal)
_ = plt.plot(bm_d,bm_p)
plt.xlabel('random normal values')
plt.ylabel('percentiles')
plt.title(r'$\bf{ECDF \ of \ a \ Bimodal \ data}$')
plt.annotate(r'$2^{nd}$'+'Curve --> 2 Peaks \n i.e. Bimodal Data',xy = (0.5,0.5),color = '#001c58')
plt.show()

We just saw how in ecdf we can detect bimodal data as well. I hope by this time you recognize it’s merit.

Just for fun, let us plot Versicolor, Virginica and Setosa species together and you will see how we can compare the distribution among different categories of species.

#get data for other iris species
versicolor = iris[iris['target'] == 1]
virginica = iris[iris['target'] == 2]
#plotting all species of iris
plt.figure(figsize=(18,10))
plt.subplot(1,2,1)
_ = plt.hist(setosa['petal length (cm)'], color = 'red',alpha = 0.4,label = 'setosa')
_ = plt.hist(versicolor['petal length (cm)'], color = 'blue',alpha = 0.4,label = 'versicolor')
_ = plt.hist(virginica['petal length (cm)'], color = 'green',alpha = 0.4,label = 'virginica')
plt.xlabel('random values')
plt.ylabel('count')
plt.title(r'$\bf{Histogram \ with \ all \ species}$')
plt.subplot(1,2,2)
d_set, p_set = empirical_cdf(setosa['petal length (cm)'])
d_ver, p_ver = empirical_cdf(versicolor['petal length (cm)'])
d_vir, p_vir = empirical_cdf(virginica['petal length (cm)'])
plt.plot(d_set,p_set, color = 'red', label = 'setosa')
plt.plot(d_ver,p_ver, color = 'blue', label = 'versicolor')
plt.plot(d_vir,p_vir, color = 'green', label = 'virginica')
plt.xlabel('random values')
plt.ylabel('percentiles')
plt.title(r'$\bf{ECDF \ with \ all \ species}$')
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

How is ECDF different from CDF?

CDF is a theoretical construct, while the empirical CDF is drawn from real data. I found few really good examples of ecdf vs cdf on StackOverflow, I will leave the link in case you are interested to know more.

Summary

I hope this blog post has helped you to understand ecdf and histogram better. Please comment, If you have any constructive feedback.

--

--