Six steps to hone your Data: Data Preprocessing, Part 1

Published in

GoDataScience

5 min readAug 9, 2020

Who is the targeted audience?

This tutorial is for anyone wishing to start their Machine Learning journey or want to brush up some concepts. Data Preprocessing is an essential part of your ML journey, as you move ahead, you will realize the critical role data preprocessing plays in building, training, and testing your models accurately.

Why would this tutorial prove useful?

You must know the right way to preprocess your data to train the ML model accurately. Data Preprocessing is the base you need to harness to get a grip on various branches of ML.

If you are a novice in the ML world, stay tuned to this series, as this is what you need to get started on your journey to becoming a Machine Learning Practitioner.

This journey is a super long one, but with consistent efforts, you will find yourself enjoying and exploring more and more ML.

Always remember, no skill has ever developed in a day or a week.

Practice and Patience is the key!

Let us dive right in, folks! All the best.

This tutorial answers all the following questions:

What is Data Preprocessing?
Why is it necessary?
What happens when it is skipped?
What are Libraries?
How to import Libraries?

So before we move ahead, let us get clear out what data preprocessing means. I’m guessing from the previous preface you must’ve pictured that it is an integral part of developing your ML models.

But why is it important?

Let’s explore a little more before we get started with Step 1.

What is Data Preprocessing?

Data Preprocessing is a step that helps in enhancing the quality of your data to promote the extraction of meaningful insights from data. It is a technique of preparing raw data (cleaning and structuring) to make it suitable for building your ML models.

In simple words, Data Preprocessing is a technique for the transformation of raw data into an organized and readable format.

Why is Data Preprocessing necessary?

Typically, the data in the real-world is unstructured, incomplete, inconsistent, and inaccurate and often lacks the insights of specific trends.

When we talk about datasets, our mind immediately pictures it as a spreadsheet of columns and rows. While that is one of the legit scenarios, it might differ for various others as data could be in so many different forms: Structured Tables, Images, Audio files, Videos, etc. Machines don’t understand the free text, image, or video data as it is, they interpret 1s and 0s.

So it probably won’t prove to be a good idea if we just put on a presentation of all our images/videos and expect our machine learning model to get trained by that!

And that is where data preprocessing comes into play and is a crucial step!

What is the outcome of feeding unprocessed data to the ML model?

If we skip Data Preprocessing, then the model which we create might not be able to predict accurate results as it should.

Now that you have a clear intuition about what data preprocessing is, and its importance, let us get started with Step 1 of Data preprocessing!

Step 1] Importing the required Libraries.

The first step of Data preprocessing is importing all the necessary Libraries.

What are Libraries?

A library is essentially a collection of modules that can be called and used.

A lot of the things in the programming world do not need explicit coding every time they are required.

There are functions for them that will do your task in a jiffy when invoked. These are three Python libraries for Data preprocessing that you will need to import :

NumPy — NumPy is a library for scientific computing in Python. It provides us with multidimensional arrays and helps in computing various operations on them.
Pandas — Pandas is the home for your datasets. Pandas library helps in analyzing, cleaning, and transforming data.
Matplotlib — Matplotlib is a visualization tool. This library helps us in plotting our dataset in 2-D visual graphs, making it easy for us to identify useful patterns and trends.

Why are these Libraries important for Preprocessing?

You might wonder why we import only these libraries when there are many options available.

Let us understand the importance of these specific three libraries in data preprocessing.

Why Numpy?

Numpy is mainly used as it facilitates advanced mathematical computation and all other related operations on large datasets.

These operations are executed efficiently with a small amount of coding.

Numpy uses less amount of data to store large datasets as compared to python lists. Thus space is saved, and access time is faster. (*win-win situation*)

Why Pandas?

Pandas is the master for real-world data analysis.

Employing the Pandas library makes it easier and intuitive for developers to work with labeled or relational data. It offers expressive, fast, and flexible data structures.

One of the most powerful features of pandas is to translate complex data using just one or two commands. It ensures that the process of data manipulation is easy.

It supports aggregations, concatenations, iteration, re-indexing, and visualizations operations.

Why Matplotlib?

Visualizations are the easiest way to analyze and absorb information. Visuals help to understand complex problems quickly.

Matplotlib emulates Matlab like graphs and visualizations.

Matlab is not free; it is difficult to scale, and as a programming language is tedious.

So, matplotlib in Python is used as it is a robust, free, and accessible library to use for data visualization.

How to import python libraries?

Here is a snippet to import these libraries and assign a shortcut name for each.

Congratulations on completing Part 1 of Data preprocessing!

You’ve officially initiated your ML journey!

Stay tuned.

(Image Source: Internet)

Six steps to hone your Data: Data Preprocessing, Part 1

Who is the targeted audience?

Why would this tutorial prove useful?

Written by Anushkad