Six steps to hone your Data: Data Preprocessing, Part 2

Anushkad

Published in

GoDataScience

5 min readAug 12, 2020

Assuming you have a clear intuition about what data processing is, we will now venture towards the next step.

If you have not checked out the first step of Data preprocessing yet, I would suggest you read our previous article.

With that said, let us directly jump to what the second step of data preprocessing is!

Now that we have successfully imported all the required libraries, we need to proceed to Step 2.

Step 2] Importing Datasets

This tutorial answers all the following questions:

What are datasets?
What is a CSV file?
How to import a CSV file?
What are the independent variables/features?
What are the dependent variables/targets?
How to separate the Dataset into features and targets?

What are datasets?

You can think of datasets as a spreadsheet containing rows and columns of data. This Dataset is composed of data gathered from multiple and disparate sources and then combined in a proper format to form a dataset.

The format of the Dataset is dependent on the data it holds. For instance, a banking dataset will be completely different from a medical dataset. While the banking dataset may contain finance-related data, the medical-related Dataset will contain health-related data.

You can also create your Dataset with the help of different available python APIs. Once your Dataset is ready, you need to put it into CSV, HTML, or XLSX format.

A lot of data sets generally come in CSV format because of its simplicity and ease to use.

What is CSV file/CSV format?

A Comma Separated Values (CSV) file is a plain text file that contains a list of data.

These files may sometimes be called Character Separated Values or Comma Delimited files. They mostly use the comma character to separate (or delimit) data.

A CSV file has a simple structure. It is a list of data separated by commas.

For example, let us say you have a few employee data in a register, and you export them as a CSV file. You would get a file containing text like this:

That is what a CSV file is, except it can be a lot bigger and can contain thousands of columns and rows.

(Note: Open an excel sheet, type out this dataset data in a different column and save it in CSV-UTF8 format, and you are ready to go!)

How to import a dataset present in CSV format?

To import a dataset, you need to make sure that your CSV file is in the same directory where your python program resides (it’s more efficient to keep the Dataset in the same directory as your program).

If the CSV file is not present in the same location as your program, you need to copy the address of your CSV life.

Then once the location factor is sorted, we can read the CSV file using a method called read_csv, which can be found in the library called pandas.

The code snippet below will help you in importing the CSV dataset.

Now that we have successfully imported our Dataset, we need to separate the dependent and independent variables.

What are the independent and dependent variables? How do we identify them?

To understand this, let us consider another dataset to find out if a person purchases a product XYZ based on features like individuals Age, Education, Income, and Marital Status.

Now, based on the above example, let us understand what features and target variables are.

What are Independent variables or features?

Independent variables (also referred to as Features) are the input for a model that is being analyzed.

The independent variable is the variable the experimenter changes or controls and is assumed to have a direct effect on the dependent variable.

In the above example, Age, Education, Income, and Marital status are independent variables or features, which means that whether or not an individual will make a purchase is dependent on these factors.

What are Dependent variables or features?

Dependent variables are the output of the process.

The dependent variable is the variable being tested and measured in an experiment and is ‘dependent’ on the independent variable.

In the above example Dataset, the Purchased factor is dependent on features. Hence it is a dependent variable.

Now that we are clear with the terms, we need to separate the feature variables and the target variables.

How to separate features and target variables?

After inspecting our Dataset carefully, we are going to create a matrix of features in our Dataset (X) and create a dependent vector (Y) with their own observations.

To do so, we will be using the iloc function from the panda’s library, which takes two parameters: [row selection, column selection].

Here’s the code snippet to do so along with the respective output.

“: ” as a parameter selects all.

So the above piece of code selects all the rows.

For columns, we have “:-1 ”, which means all the columns except the last one.

With that said, now we have a nice idea of how to import libraries and head on to import the Dataset and separate them into features and the target.

Congratulations on completing the second step of the series!

It might seem unusual at the beginning, but I am sure that with some practice, you can get a good grip on preprocessing data and structuring it for your ML model.

Stay tuned for the next steps in Data Preprocessing!

(Image Source: Internet)

Six steps to hone your Data: Data Preprocessing, Part 2

Written by Anushkad