# Data Spectrometry or How to Preprocess your Data

This is the first article of a to-be series devoted to figure out a way to tackle a Machine Learning (ML) competition. A few months ago, the MLJCUnito gained access to an international ML competition on climate change, organized by the University of Toronto: www.projectx2020.com. This competition will start on September 2020, so we decided to start gathering some useful material related to competition dynamics such as Data Preprocessing, Feature Engineering and convert them in a lecture-like format.

In this first part, we’d like to tell you about some practical tricks for making **gradient descent** work well, in particular, we’re going to delve into feature scaling. As an introductory view, it seems reasonable to try to depict an intuition of the concept of *scale*.

# Macro, Meso, Micro-scale in Science

As scientists, we are well aware of the effects of using a specific measurement tool in order to characterize some quantity and describe reality. As an ideal example we consider the **length scale**.

We can identify three different points of view: *microscopic*, *mesoscopic* and *macroscopic*; which are intimately related to the adopted lenght scale.

We usually deal with the *macroscopic scale* when the observer is in such a position (pretty far, in terms of distance), with respect to the object, that she/he can describe its global characteristics. Instead, we do refer to the *microscopic scale* when the observer is so close to the object that she/he can describe its atomistic details or elementary parts (e.g. molecules, atoms, quarks). Last but not least, we talk about *mesoscopic scale* everytime we are in between micro and macro.

These definitions are deliberately vague, since delineating a precise and neat explanation would be higly difficult and complex, and it’s actually far from our purposes.

On the other side, this kind of introduction is quite useful, we should take a few minutes to think about the “active” role of the observer and about the fact that, to be honest, for every length scale, there’s some specific theory, i.e. there’s no global theory for a multi-scale description of some phenomenon.

# Scaling in Data Science

If our beloved observer (i.e. the scientist) has some kind of “privilege”, i.e. choosing the right measurement tool, which is nothing but choosing the right scale in the description of some phenomenon, we can’t really say the same for a data scientist.

It’s a sort of paradox, but a data scientist can’t really deal with data retrieval most of the times. Because of that, a data scientist is often left alone in front of data, without even knowing from which measurement tool they’re coming from. There’s no way to interact with the length scale for example.

Is there something that we can do about it? The only thing we can do is assuming that features are independent and scale these features in order to have something compatible from one to the other. This procedure is called **feature scaling**, and soon we’ll understand why it is useful even for ML algorithms, such as gradient descent.

If you make sure that features are on similar scales, i.e. features take on similar range of values, then gradient descent can converge more quickly.

More concretely, let’s say we have a problem with two features where ** x**₁ is the length of a football field and take values between

*90*(meters) and

*115*(meters) and

**₂ is the radius of a ball which takes values between**

*x**10.5*(centimeters) to

*11.5*(centimeters). If you plot the countours of the cost function

**then you might get something similar to the**

*J(ω)**left plot*, and because of these very skewed elliptical shape, if we run gradient descent on this cost function, it may end up taking a long time and oscillating back and forth before reaching the global minimum.

In these settings, as stated previously, a useful thing to do is to scale the features. Generally, the idea is to get every feature into approximately a *-1* to *+1* range. By doing this, we get the *right plot*. In this way, you can find a much more direct path to the global minimum rather than taking a much more convoluted path where you’re sort of trying to follow a very complicated trajectory.

# Preprocessing Data

In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

We’re going to dive into Scikit-Learn for this section and exploit its powerful *processing* package.

We’ve been talking about scaling our data, now it’s time to understand how to put our hands on code and try to do that. Usually, as previously stated, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers might be more appropriate. (Take a look at Compare the effect of different scalers on data with outliers, you’ll see the behaviors of the different scalers, transformers and normalizers with outliers).

# Standardization

Many Machine Learning estimators require *standardization* of datasets, elseways they might behave badly because data are far from a Gaussian (with zero mean and unit variance) distribution.

Most of the times, we ignore the shape of the distribution and just transform the data by subtracting the mean value of each feature, then scale by dividing features by their standard deviation.

*Do you have in mind some models that assume that all features are centered around zero and have variance in the same order of magnitude? Can you think about possible issues related to the objective function in these cases?*

A possible answer: many elements used in the objective function of a learning algorithm (such as the *RBF kernel* of *Support Vector Machines* or the *l1* and *l2* regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

There’s a fast way to do that on a single array, by means of the *scale* function.

The *preprocessing* module provides a utility class *StandardScaler* that compute the mean and std on a training set so as to be able to later reapply the same transform on the test set.

(You should be well aware of what *sklearn.pipeline.Pipeline* is, it’s crucial for strategies’ deployment.)

Now we can use the scaler instance on new data, to transform them in the same way we did previously.

It is possible to disable centering or scaling by passing *with_mean = False* or *with_std = False*. The first one might be particularly useful if applied to sparse CSR or CSC matrices to avoid breaking the sparsity structure of the data.

**Scaling Features to a Range**

Another standardization is scaling features to lie between a given minimum and maximum value, or so that the maximum absolute value of each future is scaled to unit size. This can be achieved with *MinMaxScaler* or *MaxAbsScaler*.

Here you can see how to scale a toy data matrix to the *[0,1]* range:

In the same way as above, the same instance of the transformer can be applied to some new test data: same scaling and shifting will be applied to be consistent.

It’s pretty useful to let the scaler reveal some details about the transformation learned on the training data:

Can you retrieve the explicit formula for *MinMaxScaler*?

Here’s the solution

*MaxAbsScaler* works in a similar fashion, the data will lie in the range *[-1,1]*. It is meant for data that is already centered at zero or sparse data.

**Scaling Data with Outliers**

If our data contain many outliers, scaling using the mean and variance of the data is not likely to work well. In this case, we can use *RobustScaler*.

This scaler removes the median and scales data according to the IQR (InterQuartile Range).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

Median and interquartile range are then stored to be used on later data using the *transform* method.

# Non-linear transformations

It’s possible to generalize to non-linear transformations. We are going to talk about two types of transformations: *quantile transforms* and *power transforms*. The main take-home message is that we need *monotonic* transformations to preserve the rank of the values along each feature.

Quantile transforms smooth out unusual distributions and are less influenced by outliers than scaling methods. It does distort correlations and distances within and across features.

Power transforms are, indeed, a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution.

**Mapping to a Uniform Distribution**

*QuantileTransformer* provides a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1:

This feature corresponds to the sepal length in cm. Once the quantile transform is applied, those landmarks approach closely the percentiles previously defined

Some more applications here.

**Mapping to a Gaussian Distribution**

Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a Gaussian distribution. Power transforms are a family of parametric, monotonic transforms that aim to map data from any distribution to as close to a Gaussian distribution as possible, in order to stabilize variance and minimize skewness.

*PowerTransformer* provides two transformations, the *Yeo-Johnson* transform:

and the *Box-Cox* transform:

Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parametrized by $\lambda$, which is determined trough maximum-likelihood estimation. Here an example of using Box-Cox to map samples drawn from a lognormal distribution to a normal distribution:

(Some more applications here)

Below some examples of the two transforms applied to various probability distributions, *any comment*?

# Normalization

As scientists, we feel much more comfortable with Vector Space Models. *Normalization* is the process of scaling individual samples to have unit norm. This process might be useful if we plan to use a dot-product or some kernel to quantify similarities of pairs of samples.

The function provides a quick and easy way to perform this operation on a single array, using L1 or L2 norms:

The *preprocessing* module provides a utility class *Normalizer* that implements the same operation using the *Transformer* API. This class is suitable for *sklearn.pipeline.Pipeline*

# Encoding Categorical Features

In many cases, features are not continous values but categorical. E.g. a person could have some features: `["from Italy", "from France", "from Germany"]`

, `["play sports", "doesn't play sports"]`

, `["uses Firefox", "uses Opera", "uses Chrome", "uses Safari", "uses Internet Explorer"]`

.

Such features can be efficiently coded as integers, for instance `["from France", "play sports", "uses Chrome"]`

could be `[1,0,2]`

.

To convert categorical features to such integer codes, we can use the *OrdinalEncoder*. In this way, we transform each feature to one new feature of integers *(0* to *n_categories-1)*:

Some scikit-learn estimators expect continuous input, and would interpret categories as ordered, which is usually not desired.

There’s another way to convert categorical features to features that can be used with scikit-learn estimators: *one-hot encoding*. It can be obtained with the *OneHotEncoder*, which transforms each categorical feature with `n_categories`

possible values into `n_categories`

binary features, with one of them 1, and all others 0.

Let’s continue with the example above:

The values each feature can take is inferred automatically from the dataset and can be found in the `categories_`

attribute:

We are done with a brief introduction to Data Preprocessing. I’ve tried to show some of the most useful features of *scikit-learn, *which is definitely one of the cardinal libraries for ML. In the next lecture we are going to dive into Feature Engineering, which I personally consider the most fundamental part of a ML pipeline. Stay tuned for the next article !👩💻👨💻