# The ZeoTech Series: Estimation of Samples : Part I

In a real world scenario, we often come across data with highsparsity, a large number of features with ahigh categorical dimensionality. Due to such complexity of the data, defining a distribution for the same is a challenging task. As the distribution is not a standard one, basic sampling techniques cannot be applied here.

We overcome this challenge by transforming this complex data into a lower dimensional continuous space calledEmbeddings,followed by sampling techniques that approximate this data distribution.

Part I: Problems and solutions for real-world data

Part II: Generating samples

*Why should I be interested in dataset samples ?*

- With the samples (representation of the dataset), we can run the numbers to quantify various estimates real quick, and simply extrapolate those numbers for the entire population.
- The samples are meant to capture the majority of the variance of the data. Hence, training any model on this sampled data would reflect if there’s something terribly wrong with the architecture of the model.
- In the case of unsupervised learning, it’s difficult to gather intuition in the form of correlations about the features. Hence, the manual dropout of features or manually observing samples would help gain meaningful data insights.
- Hyper-parameter tuning for AutoML requires training on efficient samples.

### PART I : Problems and solutions for real-world data

#### A) Problems associated with real-world data

First things first, problems associated with widely sparse and high dimensional data are immense. To make matters worse, categorical inputs with no ordinal relationships amongst themselves are also encountered. Let’s call this data — DEVIL(Data Everyone Vexes In Life) and learn more about it.

**Computational workload**

**High dimensional input** poses various problems. To start with, the computation cost increases exponentially as the number of possible combinations of the variables increases. This would result in an explosion of **computational workload** as depicted in Fig. 1 below.

**2.** **Dimensional Redundancy**

There might exist strong collinearity between features as in most cases, attributes are unlikely to be independent of one another. This introduces **dimensional redundancy**.

The problem of high dimensional data can be taken care of by feature engineering, but we’ll take an algorithmic approach to solve this issue.

**3.** **Fitting data into standard distribution**

In higher dimensions, see Fig. 2, data is more sparse and more flung out in the outer region than in the center of the hypercube. Hence, **defining a standard probability distribution (i.e. a hypersphere) **for such data (all points in the hypercube) which would contain most of its points in/around the mean of the probability distribution (say, multivariate Gaussian distribution), gets difficult as dimension increases. See, the standard probability distribution cheatsheet. As defining a probability distribution for this data is not straightforward, basic sampling techniques won’t be applied here.

*The three points above constitute what’s known as the **Curse of Dimensionality**. Long story short, the common theme of these problems is that when the dimensionality increases, the **volume** of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. Reference:**https://en.wikipedia.org/wiki/Curse_of_dimensionality*

**4.** **Ineffectiveness of similarity measures**

Referring to Fig.2, if a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbour. If we approximate a hypersphere, (a distribution) by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere, i.e. the points essentially become uniformly distant from each other. So, **similarity measures** based on distance functions (like Euclidean metric) are rendered **less effective**.

Similarity measures are even less effective when dealing with **categorical variables** instead of continuous, as the value would just indicate the presence/absence of certain values and no insights on the degree of similarity. The combination of multi-variate features for distance functions would further take away any knowledge of which particular features went missing.

**5. Insignificance of joint probability distribution**

With high dimension features and sparse categorical data, there would be a huge number of possible combinations of feature values. So, the joint probability of most of the combinations would be too low to gather any meaningful insights out of it. Referring to Fig. 3, the **joint probability distribution** is expressed as a probability mass function, as for our discrete case.

Hence, it makes sense to describe our joint probability distribution in the form of continuous random variables instead of discretizing every variable. So, probability density distribution (continuous) needs to be defined to get some significant values.

**6. Random sampling limitations on complex data**

Simple random sampling can lead to **under-sampling **or** over-sampling** of data. This is because input data are not all alike so it requires more percentage of samples to be picked up from the more frequent type of data points and vice versa for less frequent ones. Hence, a class imbalance is another reason to move into continuous space, so that input data properties can be exploited (using embeddings, defined below) to pick a percentage of samples in accordance to their existence.

Because of all the above-mentioned problems associated with high dimensional and sparse categorical input, we compress the DEVIL into ** lower-dimensional continuous space **called

**which is explained in section B below.**

*Embeddings,*#### B) Solutions to real-world data problems

An **embedding** is a relatively low-dimensional space into which you can translate high-dimensional vectors. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space, quantifiable with the help of distance metric. Reference: https://en.wikipedia.org/wiki/Embedding

This transformation to continuous space, wherein two data points are comparable, is the basis of the definition of our sampling techniques. We shall read about it in detail in the next post. Let’s continue with embeddings now.

*Visualization of embeddings:*

*I.Tabular form*

In Fig. 5, we see that the left table with categorical and continuous data has 6 features which are reduced/compressed into 3 continuous features in the latent space. The latent space is vectors of 3 dimensions, D1, D2 and D3.

*II. Embedding space*

Position (distance and direction) in the vector space can encode semantics in a good embedding. For example, the above visualizations of real embeddings show geometrical relationships that capture semantic relations like the relationship between a country and its capital.

*Embedding generation techniques for DEVIL:*

- Non-Linear dimensionality reduction techniques: Auto-Encoder github link , What are auto-encoders?
- Co-occurrence based models (GloVe architecture, github link )
- Linear dimensionality reduction techniques: PCA (unsupervised)

To learn more about embeddings, check this out.

In summary, we learned about the problems linked with high dimensional, sparse and categorical input and how we overcome these challenges by converting them into a lower dimensional continuous space called embeddings. These embeddings are then going to help us in our pursuit of sampling.

Now, proceed on to the second **post** in this series to learn about various sampling techniques.

**ABOUT THE AUTHOR**

Shreya Jain is a Senior Data Scientist at zeotap. Based out of Bangalore, Shreya’s core areas of work include Machine Learning modeling, Probabilistic modeling, Neural Networks and Recommendation systems. After her studies at Birla Institute of Technology and Science Pilani, she worked at Samsung Electronics as a Software Engineer for two years before deciding to embark in an exciting data journey with zeotap!

*Originally published at **medium.com** on December 13, 2018.*