Data Preprocessing with Numpy and Pandas

RADIO SAYS Arpit pathak
4 min readMay 28, 2020

--

This blog explains the pre-processing of the data by using two of the Python libraries : Numpy and Pandas . Prerequisite : You must have a very basic knowledge of Python programming language in order to understand the practical implementation explained here .

In order to train a Machine Learning model over a data , the first and the foremost requirement is that all the data should be in numerical format . In addition to this , the data should always be complete and uniform throughout the dataset . To achieve this , the preprocessing of data is required .

Numpy and Pandas

Numpy and the Pandas are the two basic libraries of Python required for machine learning . The requirement of the data for a machine learning model is that it should always be numerical as well as in the form of an array . So , Numpy and Pandas are those libraries to perform some basic preprocessing over the dataset to make it suitable for giving it as an input to a machine learning model to get trained .

Numpy is a library of python for scientific computing over n-dimensional arrays . It contains support for carrying out different mathematical operations over the arrays . It is also called ndarray and also known as an alias array .

Pandas is a library in python dedicated to data analysis . It is created over the Numpy library and contains many types of high level data manipulation tools in it . Key data structures supported by Pandas are series (1-dimensional arrays) , dataframes (2-dimensional arrays) and panels (3-dimensional , size mutable arrays) .

Dataset

Dataset is a collection of data over which the machine is to be trained . It can be of any type such as text , numerical , images , audio or video . For simplicity , let us start with the dataset in the form of a file having text and numerical values as shown in the below diagram .

Dataset

The above image shows a dataset in the form of a .csv (spreadsheet) file in which there are some rows and columns . Each row is known as a record and each column is known as a variable . The dataset shown above is in the form of a 2-dimensional array with 10 rows and 4 columns . It gives the data that whether a commodity is purchased or not by the people whose age , salary and country name are given . This dataset can be used to train the model to predict in future that whether a person will purchase the commodity or not based on the different factors like salary , country and age .

But before making a model for training it over this data , it is required to do some preprocessing in the data . We can see that there are some empty fields or the data is not complete . Other thing that we can observe is that the data is not fully numerical as the variables “Country” and “Purchased” have string values which need to be converted into numerical values .

Let us now move to practical example of data preprocessing . For this , we will be using the Jupyter notebook that can be installed by installing Anaconda IDE for Python . For downloading Anaconda in 64-bit Windows , Click here . For other OS installers for Anaconda , Click here

Data preprocessing

  1. Importing Libraries Numpy and Pandas and loading the dataset —

2. Analyzing the dataset for non-numerical and null values —

3. Changing non-numerical values to numerical values —

4. Removing Null Values

There can be many methods to remove null values . We can either remove the records from data having null values or can assign the null values with a mean , median or mode of all the valued present in the column . Removing the whole record is the lass of data at some extent and hence the replacing of null values is a better option .

Hence , all the simple preprocessing over the dataset is completed before building the model for training it over this dataset . Now the dataset is ready for giving as an input to the model .

Here comes an end of this blog , upcoming blogs will carry the breaking the dataset into dependent and independent variables and the process of scaling of the data for model building . Thank you for reading . Have a good day..!!!

--

--