(Even) better histograms for noisy data

In 1998, Jeffrey Scargle invented an algorithm to perform optimal binning for photon counting data in gamma-ray observations. He named the algorithm Bayesian Blocks and provided an improved version of it in 2012. It was soon thereafter implemented in Python by Jake Vanderplas with a dynamic programming solution. Vanderplas has since developed a comprehensive implementation of Bayesian Blocks in his astronomy machine learning package astroML.

Image for post
Image for post
Vanderplas illustrates the benefits of Bayesian Blocks in the astroML package

However this implementation suffers from poor performance on one pathological type of data. In particular, data that contain highly-repeated unique values. Although the astroML implementation of Bayesian Blocks addresses this problem for small amounts of repetition, it fails for unique values that have multiplicities on the same order as the total size of the data. …

About

Jan Florjanczyk

Senior Data Scientist @Netflix

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store