The ZeoTech Series: Estimation of Samples : Part II

In the previous post, we learnt why and how to transform our real world, high- dimensional-categorical data to a lower dimensional continuous space, ‘Embeddings’. In this post, we shall learn two ways to generate samples from these embeddings.

PART II : Generating samples

The idea of sampling methods is to be able to sample all such points in an appropriate number that best explains the variance in data, such that, it gives a close representation of input data. The two techniques defined below take embeddings of training data as input and outputs samples. Check the previous post to learn how we reached to embeddings from high dimensional categorical input.

1. RBF-based Sampling

With this sampling technique, we are going to exploit embeddings to our advantage, as embeddings can be used to attain a quantifiable value by calculating the Euclidean distance between two embedding vectors. This is then used to find similarity between two training data points. The basic idea is to pick more samples that are closer to the mean and move radially outwards.

Let’s dig into the specifics of RBF:

What is RBF ?

Radial Basis Function is a real-valued function, Φ, whose value only depends on the distance from some other point, c, called a center, so that:

Fig.1 Any function, phi, that satisfies this property is called a radial function. (Also used in neural networks to approximate functions in a linear combination)

The norm is usually Euclidean distance, although other distance functions are also possible. Reference:https://en.wikipedia.org/wiki/Radial_basis_function

What is Gaussian RBF ?

There exist multiple types of RBFs. Here’s Gaussian RBF, where x is the sample picked, μ is the mean (the c vector from above) and Φ is a defined function, check Fig. 2. Gaussian RBF is used to our advantage in the calculation of probability as the output of this function is contained between 0 and 1 for positive inputs. The output value represents the probability of acceptance of sample, x.

Fig. 2 Gaussian Radial Basis Function, β controls the width of the curve, shown below

This definition of probability calculation will be clearer by taking two boundary conditions. If the input (candidate sample, x) is equal to μ (from Fig. 2), then the output of this RBF will be 1, i.e., the probability of selection is 1. As the distance between the input and the prototype grows, the response falls off exponentially towards 0. This ensures more selection of data points that are closer to the mean.

Graph for RBF:

Fig. 3 The shape of the RBF response is a bell curve, as illustrated in the figure. Tuning values of β would determine the proportion of samples to be taken from a low probability region to a high probability region

Algorithm:

1. Calculate the mean (μ) of embeddings for all dimensions.
2. Pick a batch of samples and calculate the mean, μ(z), of embeddings of the batch.
3. Calculate the probability of selection from Gaussian rbf, refer Fig. 2. Taking μ from Step1 and x as μ(z) from Step2.
4. Accept/Reject the batch of samples based on the output value from Step 3 with Roulette wheel selection.

Notes :

  • Roulette wheel selection is an acceptance criterion in which a random value is uniformly sampled from a unit distance [0,1]. This value is then compared to the sample probability selection from Gaussian RBF. If the output of RBF (probability of acceptance) is greater than the random sample from, the candidate sample is selected.
  • Tweaking the values of β in RBF to get the desired ratio of high probability values vs low probability values is a challenge. Otherwise, it results in biased sampling, favoring more of high probability values.

2. Variant of Stratified Sampling

Stratification/Stratified sampling is the process of dividing members of the population into homogeneous subgroups (strata) before sampling. This is followed by applying simple random sampling or systematic sampling within each stratum. The objective is to improve the precision of the sample by reducing sampling errors. Reference:https://en.wikipedia.org/wiki/Stratified_sampling

Stratified Sampling:

Fig. 4 There are 4 strata with 3 members in each stratum. Now, the number of members in each stratum can be different, depending on the dividing policy. Sampling can then be carried out by random sampling a fixed percentage of data from each stratum.

Stratified sampling is performed on a latent space (i.e. the embedding space) and not directly on input data. Like the previous method, this sampling technique also exploits a property of embeddings, which is, attaining a quantifiable figure to find a similarity between two embedding vectors. The algorithm is defined in detail below.

Notations:

μ — Mean of embeddings

M — Number of strata points to be defined

N — Total number of input examples

f — Fraction of data to be taken from each strata

K — Number of samples desired

Algorithm:

1. Calculate mean (μ) of embeddings for all dimensions.
2. For every vector of embedding, calculate Euclidean distance of it from the μ. Check Fig. 5
3. Lay out these distances on a number line with minimum and maximum distances found from Step 2. as its extremes. Check Fig. 6
4. Divide this line of distance into M uniform buckets/strata.
5. Take a fraction, f, of samples from each strata (mini-samples)
6. Combine all such M mini-samples to form a big sample.
Fig. 5 Conversion from Embeddings (left) to distances (right) from μ.
Fig. 6 Distances number line, divided into M buckets of uniform length

Estimation of parameters:

Parameters N and μ are known.

How to define M and f ?

  1. First, fix on a number of samples desired, say, K < N.
  2. Now, fix on a value of the number of strata points, M. The bigger the M, the better the sampling. But as M increases, so does the computational workload. There’s a trade-off between accuracy and computation here.
  3. f*N ~ K. Considering this as an equation, we can calculate the value of f.

Convergence criteria for the above two sampling techniques: The mean(μ) and standard deviation(σ) of the embeddings of the sample-set and the entire data should be of the same range. The closer they are, the better the approximation.

In conclusion, a real-world high dimensional, sparse and categorical data is first transformed to a lower dimensional continuous space, embeddings. These embeddings are then taken as input for sampling techniques. RBF based sampling and variant of stratified sampling are two techniques that were discussed in this post.


Thanks for reading! If you have any feedback or questions, please leave a response below!

ABOUT THE AUTHOR

Shreya Jain is a Senior Data Scientist at zeotap. Based out of Bangalore, Shreya’s core areas of work include Machine Learning modelling, Probabilistic modelling, Neural Networks and Recommendation systems. After her studies at Birla Institute of Technology and Science Pilani, she worked at Samsung Electronics as a Software Engineer for two years before deciding to embark in an exciting data journey with zeotap!