playgrdstar
quaintitative
Published in
3 min readSep 22, 2018

--

Data Munging — Scale, Transform, Clean in Python

Real world data can -

  • vary significantly in their scale, e.g. shoe sizes versus waist sizes
  • vary in their nature, e.g. real world measurements versus satisfaction scores
  • be quite dirty, given that the aggregated data can be from a range of sources, and subject to human errors during data entry

Hence, a fundamental first step before any model can be built, whether in finance or data science, is to munge the data. Common steps that need to be taken include -

  • Scaling, where we scale data so that they are of a similar magnitude
  • Transformation, where we change the nature of values in features/variables — e.g. qualitative or categorical to numeric
  • Cleaning, where we deal with missing or bad data

There are many many ways to data munge. In this notebook, we will just go through some of the common (and relatively simpler) ways of scaling, transforming and cleaning data.

Scaling

First, scaling. Two common ways are to standardise; or normalise data. The difference between the two may be quite confusing, and different folks may sometimes have different interpretations. For the purpose of this post,

  • Standardisation means that data will be transformed so that it has a mean of zero and unit variance. Mathematically, for each data point x, we will perform this operation — (x — mean(dataset))/standard deviation(dataset)
  • Normalising means we scaled the data by the maximum and minimum values of the dataset. Mathematically, for each data point x, we will perform this operation — (x-min(dataset))/(max(dataset)-min(dataset))

As you can tell from the formulas, computing these manually should be a breeze. However, there are preprocessing functions within Scikit Learn that will help us do these.

# Standardisation
from sklearn.preprocessing import StandardScaler

standardisation = StandardScaler(with_mean=True, with_std=True)
X_num_scaled_stdvar = standardisation.fit_transform(X_numerical)

# Normalisation
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(0,1))
X_num_scaled_normalised = scaling.fit_transform(X_numerical)

Transformation

There are other ways to transform data apart from scaling.

And we have actually seen it previously in action when we did 2nd and 3rd order polynomial regressions.

Say we think the square weight is a better feature to be used to predict monthly spending as the relationship is not linear, all you would have to do is to apply the relevant mathematical operation to the feature/variable, and use this for the regression. I won’t go into the details on this here as we already looked at this previously in this notebook.

np.power(responses['Weight'].values,2)

Cleaning

We clean data usually because it has missing or wrong entries. An easy way to do this is simply to apply the fillna function function in pandas. But we may want a more generalised way in which to fill in, or impute, the missing values. The preprocessor function Imputer in Scikit Learn offers this functionality.

from sklearn.preprocessing import Imputer

impute = Imputer(missing_values='NaN', strategy='mean', axis=1)
impute.fit_transform(Xm_slice[:10])

Other than the built in categories for the imputing strategy (such as mean), we can also write our own custom functions for imputation.

Transforming Categorical Data

We almost always need to transform categorical data into a numerical format. The two most commonly used preprocessors are LabelEncoder and LabelBinarizer.

LabelEncoder basically transforms each categorical value into a numerical value, e.g. Male, Female, LGBT to 0, 1 and 2.

from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
X_categorical_encoded = lb_make.fit_transform(X_categorical['Gender'])

LabelBinariser converts the categorical value into a binary format. Each categorical value will have its own column and be assigned a value of either 0 or 1.

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
X_categorical_binarised = lb.fit_transform(X_categorical['Gender'])
pd.DataFrame(X_categorical_binarised, columns=lb.classes_).head()
Out:
Female LGBT Male
0 0 0 1
1 0 1 0
2 0 0 1

The notebook on this can be found here.

--

--

playgrdstar
quaintitative

ming // gary ang // illustration, coding, writing // portfolio >> playgrd.com | writings >> quaintitative.com