Bootstrapping a Dense Probability Distribution

From a sparse distribution.

NTTP
Operations Research Bit
7 min readMar 25, 2024

--

A dense distribution of buildings. Photo by Manson Yim on Unsplash

There are various articles on Medium and elsewhere describing probability distribution identification or classification, and the different common types of analytical distributions. But what if your data doesn’t fit any of these analytic forms? What do you do? You can use kernel methods to smooth a probability distribution from coarse data. But what if you want to use the trick (or method) of the empirical distribution, which we put to good use in our MCarloRisk3D apps? In those apps, we don’t assume any distribution shape. We just take daily return data as-is and resample from it before projecting it into the future with monte carlo methods. You could sample from the kernel smoothed distribution. There is that option. But in our MCR3D applications, since we don’t assume shape and don’t use kernel methods, we don’t assume any analytic form —

note that kernel methods typically generate a weighted sum of analytic forms to model the raw data distribution

— and then, as a bonus, we don’t have to determine the parameters of that non-existent analytic form.

Origin of this method

When working with MCR3D, we noticed that the resulting forecasted probability distributions were much smoother than our raw input point distribution (of daily returns). Then we did the following thought experiment: What if we were to “project” the data into the future with a time interval of zero days? Perhaps we could get a more dense distribution of data that resembles our raw sparse data set.

In essence, this procedure described here involves bootstrapping a dense distribution from any sparse sample data that we have available. Bootstrapping is short for “bootstrap resampling,” which is just a sampling process “with replacement” of our original data set… then doing a bit of processing on each sample (each sample being an array or set of data). “With replacement” just means that the same original data point can be in the output set zero, 1, or > 1 times. This resampling can be useful if data points are expensive and not easy to get, say, from real physical tests run on hardware… especially destructive tests.

Without further ado, we just start with the code. This is written in node.js with no libraries required. Output just goes to the console, and then you can copy/paste to a file and then to your favorite stat package to analyze it. It should be easy to recode in Python or any other language, since we only use basic features of Javascript.

// just a function so we don't need any libraries

let mean = (array) => {
let sum = 0
for (let i = 0; i < array.length; i++) {
sum += array[i]
}
return sum / array.length
}

// put your data here or read it into this array from a file

let rawDistribution = [1, 5, 3, 6, 2, 1, 7, 6, 5, 3, 10]

let meanRaw = mean(rawDistribution)

// you may want to increase this
let nBootstrapTrials = 1000

let sampleMeanArray = []
let augmentedSample = []

// lower = smoother results = fewer peaks or modes in output distribution
// lower values here will also spread out the augmented distribution more
// min = 1 max = ???
const bootPerSample = rawDistribution.length * 3

for (let k = 0; k < nBootstrapTrials; k++) {
let bootstrapSample = []
// one full resample (with replacement) of all raw points
for (let i = 0; i < bootPerSample ; i++) {
let sampleIndex = Math.floor(Math.random()*rawDistribution.length)
bootstrapSample.push(rawDistribution[sampleIndex])
}

// future: insert distribution metrics computation per bootstrapSample here

// not needed unless we want to compute bootstrap mean estimate
sampleMeanArray.push(mean(bootstrapSample))
let meanDelta = sampleMeanArray[sampleMeanArray.length - 1] - meanRaw
// add meanDelta to each value in our original data set to make a new batch
// of augmented values
for (let m = 0; m < rawDistribution.length; m++) {
augmentedSample.push(rawDistribution[m] + meanDelta)
}
}

// in case you want to output the raw data
if (false)
for (let i = 0; i < rawDistribution.length; i++)
console.log("IN ", i, rawDistribution[i])

for (let i = 0; i < augmentedSample.length; i++)
console.log("OUT ", i, augmentedSample[i])
Plot of histogram of raw sparse data
Input data stats
Augmented data with bootPerSample as in above code, rawDistribution.length * 3
Augmented data stats. They are quite close to the original case of only 11 samples.
Augmented data with bootPerSample = 4. Note wider spread min/max, and more smoothing over individual peaks or modes.
Output stats for bootPerSample = 4. Stdev is larger, kurtosis (tail fatness) closer to zero, 5th to 95th percentile wider.

A key variable is bootPerSample. If we make this higher, results seem to match the data better, to the extent of maintaining peaks where they were in the original data set. This may not be a good idea if our data is really sparse, as the resulting distribution will have a (smoothed) peak at every data point. But those “peaks” may just be individual samples from some larger unknown distribution shape. If we set bootPerSample lower, augmented data is spread out more, and the probability density function is smoothed more (fewer narrow peaks). The latter seems similar to using a wider kernel width in a kernel smoothing method.

It may be useful to try to tune or optimize bootPerSample in some way, but this may be problem specific. For example, let’s say that we suspect that our data distribution should only have one peak. Then we could tune bootPerSample until the multiple peaks get smoothed over, leaving only one peak instead of having artificial peaks at each sample point. bootPerSample set to 1 would be a useful starting point for this single peak quest.

You can check the stat properties of the augmented distribution also, to see how they compare to the original data. We put some snapshots of these above so you can get a feel for what is going on. These stats are computed by the free Gretl software.

Correctness?

Which of these augments is “more correct”? Here is a problem-dependent question that we cannot answer offhand; though there may be a good universal answer, we do not know it at this writing. Yes, the distribution (moment, etc) parameters of the more dense (artificial) distributions are different from those of the 11 point initial sample. But the concept here is that our 11 point initial data set is maybe some sparse sample from a larger population or even a larger sample. We are attempting to generate the population from the sample. Just because we can compute statistical parameters from a small sample doesn’t mean that the population parameters are identical to these. We present this resampling method as another tool in your toolbox to process low count data sets. What you do with it is up to you.

Future metrics analysis and comparison

Bootstrapping is often used to get a range estimate of parameters such as mean, standard deviation, etc, of a sample distribution.

The Matlab documentation gives a good example of how to bootstrap the estimated range of a sample mean:

https://www.mathworks.com/help/stats/bootstrp.html

We could do this parameter range computation in the same loop as the distribution augment resampling and compare the parameter range estimates for the original data to the augmented data. This could give us some clue as to how to best tune the bootPerSample value to best represent the data. We will do this as time permits, perhaps in another article. See the code above marked future: for where to put these computations.

// future:

You would have to then aggregate the per-sample computed metrics (mean, stdev, etc) and get distributions of those parameters. [In this case, sample is a whole array of data, not just one point.]

Discrete data and multi dimensional distributions

If your data needs to be at discrete levels, you could probably make the necessary application of Math.round, Math.floor, or related quantization to the output data — though we haven’t tried this yet.

This method works with multi dimensional distributions also, as we demonstrate in our more formal white paper on this topic:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3681437

Summary

So there you have it: Bootstrapping dense data from sparse data. Use with caution, since the generated data is synthetic. But synthetic data may be of use! Maybe it is just as artificial to force your data into one of the common analytic forms? You’re the analyst, you be the judge!

Is this method similar to the pixel density upscaling procedures used on images and videos? Let us know what you think about this analogy in the comments!

Epilogue

And we leave you with a little rhyme:

You don’t need an analytical miracle
If you keep your data somewhat empirical.

--

--