Sampling for Deep Learning with BatchFlow.Sampler

Alexander Koryagin
Data Analysis Center
6 min readMar 12, 2021

--

What is Sampler and why we need it?

Probabilistic sampling is an important part of deep learning algorithms. For once it allows for convenient patch generation from large 2D-items (images in high resolution) and 3D-items (for instance, MRI and CT-data). Which is essential, as it leads to larger batches and training dataset augmentation. Moreover, sampling from probability distributions is necessary in a large number of probabilistic models — for instance, variational autoencoders.

Soon you will understand how to solve deep learning related sampling problems with the use of Sampler from BatchFlow-framework.

Generally speaking, standard libraries used for data science — NumPy and SciPy — already have some sampling capabilities. Both of these frameworks allow to sample from a wide variety of standard distributions, including normal, exponential and uniform distributions:

Often, though, these distributions need to be combined, mixed, and somehow transformed to be of use for data-science pipelines. That’s where Sampler from BatchFlow-library comes in.

To understand the usefullness of Sampler, imagine you have to train neural network to detect geobodies — seismic horizons in the example below — on large seismic cubes. Clearly, it is not possible to put whole cubes into GPU-memory. That is, you need small patches of cubes for training. When sampling randomly, it is impossible to control the proportion of positive and negative training example. And, depending on the size of geobodies, there is good chance that resulting batches won’t contain any examples with geobodies.

Meaning, you want to have a good control over the proportion of positive and negative examples in each batch. Suppose also that the right third of a training cube is reserved for test. Even more, you need to focus on the central part of the rest two thirds of the cube in a “gaussian” way — as geophysisicts have told you, this part of the cube is just more difficult than the other regions.

Using Sampler, you can easily implement this procedure in a small and self-explanatory passage of code:

First of all we make a HistoSampler-object, that generates points from a seismic horizon — our region of interest. After some scaling and descaling operations a two-dimensional Gaussian is conditioned over the horizon. The next step is to mix our geobody-sampler with a second Gaussian distribution over the cube to generate some crops that do not contain seismic horizon. Importantly, during this step, you have a complete control over the weight of the Gaussian component in the mixture. Finally, we truncate the sampler to the left two-thirds of the cube, leaving out the rest one-third for test.

Alternatively, using Sampler, you can easily generate points from probability distributions on arbitrary complex domains:

…that look like this:

By the end of this article you’ll be able utilise BatchFlow.Sampler in order to create complex sampling procedures for deep learning.

Making and using sampler-instances

Making basic sampler-building blocks

The first type of basic samplers are instances of class NumpySampler. These are building blocks deriving sampling capabilities of NumPy.

We now utilize class NumpySampler to make three sampling objects that generate points from (i) standard normal distribution (ii) uniform distribution and (iii) exponential distribution

We can also use convenient aliases for each of these samplers:

Each sampler has method .sample that samples requested amount of points. Note that the second dimension of generated arrays reflects the dimensionality of a sampler:

However, a better way to see what each sampler represents is to generate a lot of points and plot either 1d or 2d histogram of a sampled set:

Most commonly, objects of interest — be it geobodies on seismic cubes, tumors on MRI-scans or trees on aerial photographs — are labelled by clouds of points. Sampling from a histogram-approximation to a cloud of points is easy with HistoSampler:

A HistoSampler-instance creation takes two steps: making histogram-approximation to a cloud of points with numpy.histogramdd and initialising the instance of HistoSampler.

While it is nice to be able to produce points from plain vanilla and datasets-related distributions, the power of BatchFlow.Sampler lies in its ability to combine and modify existing samplers in any imaginable way.

Combining and modifying basic samplers: using algebra of samplers

Sometimes one needs to concentrate a sampler on a specific part of an image. To do so, one must scale a vanilla distribution to pixel-coordinates and shift it to the interesting part of the image. Gladly, samplers can be easily subjected to arithmetic operations with each other or numpy-arrays:

Let’s visualise the effect of these operations, plotting histograms of original and modified samplers on the same plots:

Imagine that there are several domains of interest on a picture. In this case you may want to produce patches from either domain with chosen probability. Or, mathematically speaking, you want to generate points from a mixture of distributions. A pair of sampler-objects can be easily mixed together. You just need to use ‘|’-operation for mixture and ‘&’-operator for giving a sampler a lesser/larger weight in a mixture:

Another way to combine samplers is to make a direct product of them. Given two one-dimensional samplers, this operation produces a two-dimensional points-generator:

Sometimes one wants to sample points from a small neighbourhood of an object of interest rather than the object itself — for instance, from a neighbourhood of a tumorous object on MRI. In this case it is natural to use probability distributions shifted to the center of the object of interest sfor sampling. Yet, classical probability distributions, by and large, have an unbounded support, while the domain of sampling is itself limited — be it an image or MRI-scan. In this case one will find useful the operation of sampler-truncation to a domain. It is only needed that this domain can be described by the algebraic expression of the form lowexpr(x) ≤ high:

Going further, one can apply any transformation to a sampler. For instance, one can take a gaussian sampler truncated to an image and make it sample only integer coordinates for further patch-sampling:

You can now understand how to make samplers that define complex distributions with support on nontrivial shapes.

Conclusion

In the article we’ve covered how to create basic sampler-instances using point clouds and standard probability distributions. We’ve also learned how these objects can be mixed together, truncated and arithmetically combined to solve difficult sampling tasks from deep learning — for instance, patch sampling for horizon detection on seismic cubes.

Hopefully, you can now solve any probabilistic sampling task that arises from deep learning. For more info take a look at Sampler-tutorial and subscribe to our blog.

--

--