Independent and Dependent Variables (#1)

Published in

Machine Learner

4 min readSep 19, 2018

Data preprocessing is a crucial step before making a machine learning model. The model won’t work properly without it. It can be a bit boring for some, but is a necessary step in order to be able to work on a machine learning model.

To learn about data preprocessing and several other steps involved in Machine Learning, check out this insightful book.

You can get the dataset from www.superdatascience.com/machine-learning.

The dataset we have here has 4 columns-

Country
Age
Salary
Purchased

The variables here can be classified as independent and dependent variables. The independent variables are used to determine the dependent variable. In our dataset, the first three columns are independent variables which will be used to determine the dependent variable, which is the fourth column.

Before getting started, make sure you have Anaconda installed. If you don’t have it, follow the tutorial here.

Installing Python and Anaconda on Windows

This tutorial will show you how to install Python (via Anaconda) on your machine.

hackernoon.com

Importing Libraries

Python library is a collection of functions and methods that allows you to perform lots of actions without writing your own code. These libraries can be imported and this enables us to work on our code a lot faster.
The wheel can be taken as an example to understand this. It has been already invented, so the person who invented the car didn’t waste his time reinventing the wheel. Here, the car is an invention which has imported the wheel. So, the wheel is a module which can be used in other inventions as it is.
The libraries that we’ll be using here are numpy, matplotlib.pyplot(will be used in the later chapters) and pandas. The pandas library is used to import and manage the datasets.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the dataset

First, we need to set the appropriate working directory using the file explorer. You’ll find this on the top right of the Spyder window. The working directory is the directory in which your dataset is stored.

Here, we’ll be using the pandas library for importing the dataset.

dataset= pd.read_csv(‘Data.csv’)

Execute the code by selecting the line of code and pressing Ctrl and Enter together.

In the variable explorer, the dataset will be visible. It can be accessed by double-clicking it.

Now, we need to differentiate the matrix of features containing the independent variables from the dependent variable ‘purchased’.

Creating the matrix of features

The matrix of features will contain the variables ‘Country’, ‘Age’ and ‘Salary’.
The code to declare the matrix of features will be as follows:

X= dataset.iloc[:,:-1].values

In the code above, the first ‘:’ stands for the rows which we want to include, and the next one stands for the columns we want to include. By default, if only the ‘:’ (colon) is used, it means that all the rows/columns are to be included. In case of our dataset, we need to include all the rows (:) and all the columns but the last one (:-1). We have finished creating the matrix of features X. Execute the line. It can now be observed that the variable explorer shows the variable X. It can be accessed by double-clicking the ‘X’ in the variable explorer.

Creating the dependent variable vector

We’ll be following the exact same procedure to create the dependent variable vector ‘y’. The only change here is the columns which we want in y. As in the matrix of features, we’ll be including all the rows. But from the columns, we need only the 4th (3rd, keeping in mind the indexes in the python). Therefore, the code the same will look as follows:

y= dataset.iloc[:,3].values

After execution, the variable ‘y’ will be shown in the variable explorer and it may be accessed by double-clicking the ‘y’ in the same.
This completes the tutorial for differentiating the dataset into the features (or the independent variables) and the dependent variable.

Do let me know how you liked this tutorial!!

The next tutorial will cover how to handle missing data. The link will be added here when it is published.

I personally found the Python Data Science Handbook: Essential Tools for Working with Data book very useful for my data science journey, and I hope you’ll enjoy reading through it as well!

Please subscribe to updates for this series to get notified when the next article is out :)

Happy learning!