Histograms and Kernels Density Estimates

Histograms

David Crompton
3 min readMay 19, 2015

Histograms are a powerful tool in viewing univariate data and are most helpful when you want to know the overall shape of the data. They are used to answer questions such as; How spread out is my data? Are there any outliers? Is there any unusual features in my data set?

A histogram made using random data and 10 bins

Histograms like the one above are made by seperating data into specified ranges or “bins”. We then simply plot the number of data points that falls into each bin. In the histogram above we can see around 15 data points fell into the 0 bin.

The same histogram as above only with 3 bins

One aspect we can change about the histogram is the number of bins to use. The histogram above uses the same data as the first but with only with 3 bins. It appears to alter the look of the histogram quite drastically. Other aspects also include the width of the bins and the alighnment on the x-axis.

Two normalised histograms

Sometimes we want to show multiple histograms on the same plot. This gets tricky when they have different amounts of data in the data sets. This is when we normalise both histograms as we did in the plot above. Note how the y-axis is no longer a count, but instead a common scale to both data sets.

Another way of plotting two histograms as a joint plot

We can also plot two histograms as a joint plot like in the plot above. This shows us the data in 3 quantive dimensions.

Kernel Density Estimates (KDEs)

Kernel density estimates are realtively new compred to histograms and only became prominent in the 1990's due to increasing computing power. They are very much like histograms, but have two significant advantages. 1) Information isn’t lost by “binning” as is in histograms, this means KDEs are unique for a given bandwidth and kernel. 2) They are smoother, which is easier for feeding back into a computer for further calculations.

KDE work by passing a strongly sharped peak (kernel function) over each data point on the x-axis.

Tick marks represent data points with a gaussian kernel place ontop of each tick

Using a gaussian as the peak this results in something similar to the figure above, with the coloured tick marks representing the data points.

The sum of the gaussians from the figure above

To get the KDE we simply sum the value of all the gaussians, which produces the plot above. This plot tells us the probability of seeing a data at any point along the x-axis.

Same data as above only with different bandwidths

We can set the bandwidth of the gaussian to produce slightly different results in our plots. The figure above shows how the KDE changes with bandwidth. As a rule of thumb a wider bandwidth is used for smooth data sets, but a narrow bandwidth is good for data sets with lots of wiggle.

Same data plotted with a variety of different kernels

We can also use different types of kernels other than the gaussian. The resulting curves from the different kernels all produce similar curves so often to most convenient kernel is selected.

--

--