Data Preprocessing Concepts

Smruti Ranjan Pradhan
AlmaBetter
Published in
7 min readJun 6, 2021

The goal of this article is to make you understand the basic step performed before any machine learning process. The process of refining data and make it ready enough to make our machine algorithms understand and analyze the data in the required format.

What is Data Preprocessing ?

A generic idea of a typical dataset might feel like some of kind of table with huge number of rows and columns. However. This isn’t the case in most of the real life data. Data comes in numerous formats, such as structured tables, texts, images, audio files and videos.

However machines are not equipped to understand these complex data forms except for 1s and 0s. So feeding in images and asking our machines do the rest of the task for us might not be a good idea. Therefore we need to pre-process these complex data forms and transforms them into a language that computers and machines can understand. In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

What are features ?

A given dataset is nothing but a collection of data objects, which are often known as a samples or observations. Data objects are described by a number of features, that capture the most basic characteristics of an object, such as the count of a given physical entity or date and time of an event, etc.. Features are often called as variables or attributes. For example, power, mileage and color could be considered as features of a beautiful car. However these could belong to different data types, and we might have to deal with them differently.

Features can be of different types such as :

  • Categorical : Categorical variables are those features that have discrete countable set of values. For example, the day of a week can be a categorical variable with values ranging from Sunday to Saturday. Another example is could be the possibility of an event occuring (binary) : True or False
  • Numerical : These are those features which are continuous in nature. They consist of numbers. For example, number hours you sleep in a da, or the speed at which you drive your bicycle.

Now that we have learnt about the basics about different data forms, lets focus on different steps of data preprocessing.

Dealing with missing values

Because data is often taken from multiple sources which are normally not too reliable and that too in different formats, more than half our time is consumed in dealing with data quality issues when working on a machine learning problem. It is simply unrealistic to expect that the data will be perfect. There may be problems due to human error, limitations of measuring devices, or flaws in the data collection process. Let’s go over a few of them and methods to deal with them :

It is very much usual to have missing values in your dataset. It may have happened during data collection, or maybe due to some data validation rule, but regardless missing values must be taken into consideration.

  • Eliminate rows with missing data :
    Simple and sometimes effective strategy. Fails if many objects have missing values. If a feature has mostly missing values, then that feature itself can also be eliminated.
  • Estimate missing values :
    If only a reasonable percentage of values are missing, then we can also run simple interpolation to fill in those values. However, most common method of dealing with missing values is by filling them in with the mean, median or mode value of the respective feature.

Feature Encoding

As mentioned before, the whole purpose of data preprocessing is to encode the data in order to bring it to such a state that the machine now understands it.

Feature encoding is basically performing transformations on the data such that it can be easily accepted as input for machine learning algorithms while still retaining its original meaning.

There are some general norms or rules which are followed when performing feature encoding. For Continuous variables :

  • Nominal : Any one-to-one mapping can be done which retains the meaning. For instance, a permutation of values like in One-hot encoding.
  • Ordinal : An order-preserving change of values. The notion of small, medium and large can be represented equally well with the help of a new function, that is, <new_value = f(old_value)> — For example, {0, 1, 2} or maybe {1, 2, 3}.
One-hot encoding of data

For Numeric variables:

  • Simple mathematical transformation like using the equation <new_value = a*old_value + b>, a and b being constants. For example, Fahrenheit and Celsius scales, which differ in their Zero values size of a unit, can be encoded in this manner.

Dealing with Text data

  • Tokenization : We need to convert our text document comprising of sentences and paragraphs into a collection of words/ phrases/ tokens before performing any kind of further text processing. Tokenization can be performed in different ways based on the maximum and minimum number of words you want in your tokens. Tokenization in which a token consists of only one word is known as 1-gram tokenizer. Similarly, the tokenizer which splits our texts such that each token consists of maximum of n number words, is known as n-gram tokenizer.
  • Removing punctuation : Every literature has some punctuation for the readers to understand and comprehend. However including punctuation in our machine learning model might be a very bad idea since punctuation hold no physical significance when it comes to figuring out topics and understanding sentiments of texts. Also often including punctuation in text might trick the model to interpret ‘fun.’ with a full-stop and ‘fun’ without a full-stop as two different features leading to multi-collinearity and unnecessary expansion of feature set. Hence it is always a good idea to remove the punctuation.
  • Removing stop words : Stop words are nothing but the most frequently occurring redundant words in any literature. ‘the’, ‘is’, ‘have’, ‘shouldn’t’ etc are some of the examples of stop words in English literature. Since these words hold no physical significance so far, it is often imperative to do a stop word removal procedure, especially when the feature set is already too high dimensional.
  • Removing high and low frequency words : Even after removal of stop words, the feature set might still be huge due to the vastness of vocabulary in the literature. It might be then a good idea to remove the words with high frequency from the documents, since such words might not contribute much to the classification or clustering task at hand due to the same reasons that hold true for stop words. Similarly too low frequency words used across different documents might also not convey much about the nature of documents or sentiment of texts.
  • Stemming and Lemmatisation : In most literature, we see words coming from same root, but have different forms based on the tense, sentence structure and parts of speech. For example, transforming and transformation come from same root word ‘transform’. However they two have two different forms based on their use cases. In these cases, it is essential we stem the word to its origin such that the two words are not considered as different feature by the models. There is a fine line between stemming and lemmatisation. While stemming doesn’t ensure a meaningful word as an output due to lack of knowledge about context during stemming, lemmatisation does. In machine learning stemming is done by following certain set of rules predefined by algorithms such as Porter stemmer, Lancaster stemmer and Snowball stemmer. For example, the word ‘family’ gets transformed to ‘famili’ after using Porter stemmer, which doesn’t have a meaning as such. Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.
  • Text vecorization : Finally we need to make our textual data ready for training by converting them into vectors with the help of algorithms such as word2vec, count and tf-idf vectorizer. The basic idea behind such algorithms is to find out list of all unique words in all the documents combined. And then, for each document, create a vector with each component representing the some function of the frequency of that particular word/ feature in that particular document.

And this is how we preprocess various data forms before feeding into our final machine learning model.

--

--

Smruti Ranjan Pradhan
AlmaBetter

Data Scientist at Accenture | Python & ML Expert | Researcher | IISER-K Alumna