Importing & Cleaning Data with Python

Data scientists spend a large amount of their time importing and cleaning datasets and getting them down to a form with which they can work.

Shahzaib Khan
14 min readJul 27, 2020

Understanding how to prepare your data is an essential skill for working as a data scientist. It’s what you have to do before you can reveal the insights that matter.

In this tutorial, you’ll learn how to import your data from a variety of sources, including .csv, .xls, and more. After learning how to import your data it’s time to prepare your data for analysis. You will be learning the following topics:

Installation
Importing data from CSV(comma separated values)
Importing data from Text File
Importing data from Excel File
Importing data from JSON File
Data Cleansing

Note: For this post we will be using Windows as our Operating System along with PyCharm as IDE along with Pandas as a library.

You can skip the next part, if you have Python, PyCharm and Pandas already installed.

Installation

To start with, we will be using PyCharm i.e Python IDE (Integrated Development Environment).

PyCharm is an IDE for professional developers. It is created by JetBrains, a company known for creating great software development tools.

There are two versions of PyCharm:

  • Community — free open-source version, lightweight, good for Python and scientific development
  • Professional — paid version, full-featured IDE with support for Web development as well

PyCharm provides all major features that a good IDE should provide: code completion, code inspections, error-highlighting and fixes, debugging, version control system and code refactoring. All these features come out of the box.

Personally speaking, PyCharm is my favorite IDE for Python development. The only major complaint I have heard about PyCharm is that it’s resource-intensive. If you have a computer with a small amount of RAM (usually less than 4 GB), your computer may lag.

You can download the PyCharm community version which is FREE and can be downloaded from their official website and follow the steps as shown over the video:

Once you have setup Python and PyCharm. Let’s install Pandas.

To install Pandas on PyCharm, click on File and go to the Settings. Under Settings, choose your Python project and select Python Interpreter.

You will see the + button. Click on it and search for the Pandas in the search field. You will see the Pandas package as the left side and its description, version on the right side.

Select pandas, click on the Install Package on the left bottom. It will install the packages.

How to test if panda is installed or not?

After the installation of the pandas on the system you can easily check whether pandas is installed or not. To do so, just use the following command to check. Inside the PyCharm write the following code and run the program for getting the output.

import pandas as pd
print(pd.__version__)

This tutorial uses some sample files, you can download them via: https://bit.ly/issamplelinks

So we are all set to start learning.

Importing data from CSV(comma separated values)

We can create this file using windows notepad by copying and pasting this data. Save the file as datainput.csv using the save As All files(*.*) option in notepad.

Note — I have saved the file in D Drive. You can save it to any location you want.

Reading a CSV File

The read_csv function of the pandas library is used read the content of a CSV file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file.

import pandas as pd
data = pd.read_csv('D:\datainput.csv')
print(data)

Note — Change your location.

Output:

id emp_name emp_salary start_date dept
0 1 Vishavjeet 999.30 2018–12–04 IT
1 2 Nimesh 878.40 2017–09–24 IT
2 3 Sara 611.00 2014–11–15 HR
3 4 Gulfam 729.00 2014–05–11 IT
4 5 Jack 845.25 2016–07–27 Finance
5 6 Jessica 878.00 2014–05–21 Marketing
6 7 Saurav 632.80 2018–07–30 Operations
7 8 Sachin 722.50 2017–06–17 SEO

Reading Specific Rows

The read_csv function of the pandas library can also be used to read some specific rows. We slice the result from the read_csv function using the code shown below for first 2 rows.

import pandas as pd
data = pd.read_csv('D:\datainput.csv')

# Slice the result for first 2 rows
print (data[0:2])

Output:

id emp_name emp_salary start_date dept
0 1 Vishavjeet 999.3 2018–12–04 IT
1 2 Nimesh 878.4 2017–09–24 IT

DataFrames with Pandas

We could accomplish the same thing using the pandas. DataFrame.iloc method. The iloc method allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well.

Let’s take few examples:

First 2 rows:

import pandas as pd
data = pd.read_csv('D:\datainput.csv')

print(data.iloc[:2,:])

Output:

id emp_name emp_salary start_date dept
0 1 Vishavjeet 999.3 2018–12–04 IT
1 2 Nimesh 878.4 2017–09–24 IT

Entire data frames:

import pandas as pd
data = pd.read_csv('D:\datainput.csv')
print(data.iloc[:])

Output:

id emp_name emp_salary start_date dept
0 1 Vishavjeet 999.30 2018–12–04 IT
1 2 Nimesh 878.40 2017–09–24 IT
2 3 Sara 611.00 2014–11–15 HR
3 4 Gulfam 729.00 2014–05–11 IT
4 5 Jack 845.25 2016–07–27 Finance
5 6 Jessica 878.00 2014–05–21 Marketing
6 7 Saurav 632.80 2018–07–30 Operations
7 8 Sachin 722.50 2017–06–17 SEO

Rows from position 5 onwards, and columns from position 3 onwards:

import pandas as pd
data = pd.read_csv('D:\datainput.csv')
print(data.iloc[5:,3:])

Output:

             start_date          dept
5 2014-05-21 Marketing
6 2018-07-30 Operations
7 2017-06-17 SEO

We will further learn on how to play with the data in the cleaning phase. Let’s continue now to our next import method.

Importing data from Text File

The read_csv function of the pandas library can also be used to read from the text file.

import pandas as pd
data = pd.read_csv('D:\datainput.txt')
print (data)

Output:

id emp_name emp_salary start_date dept
0 1 Vishavjeet 999.30 2018–12–04 IT
1 2 Nimesh 878.40 2017–09–24 IT
2 3 Sara 611.00 2014–11–15 HR
3 4 Gulfam 729.00 2014–05–11 IT
4 5 Jack 845.25 2016–07–27 Finance
5 6 Jessica 878.00 2014–05–21 Marketing
6 7 Saurav 632.80 2018–07–30 Operations
7 8 Sachin 722.50 2017–06–17 SEO

Importing data from Excel File

The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data.

We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it.

We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.

You can create this file using the Excel Program in windows OS. Save the file as dataexcel.xlsx

import pandas as pd
data = pd.read_excel(‘D:\dataexcel.xlsx’)
print (data)

Note — You may receive an error if xlrd package is not installed. So just as we installed pandas, do the same but this time for “xlrd” package. Once done, the error will all gone.

Output:

  id emp_name emp_salary start_date dept
0 1 Vishavjeet 999.30 2018–12–04 IT
1 2 Nimesh 878.40 2017–09–24 IT
2 3 Sara 611.00 2014–11–15 HR
3 4 Gulfam 729.00 2014–05–11 IT
4 5 Jack 845.25 2016–07–27 Finance
5 6 Jessica 878.00 2014–05–21 Marketing
6 7 Saurav 632.80 2018–07–30 Operations
7 8 Sachin 722.50 2017–06–17 SEO

Reading Specific Rows

The read_excel function of the pandas library can also be used to read some specific rows. We slice the result from the read_csv function using the code shown below for first 3 rows.

import pandas as pd
data = pd.read_excel('D:\dataexcel.xlsx')
print (data[0:3])

Output:

   id      emp_name     emp_salary           start_date   dept
0 1 Vishavjeet 999.3 2018-12-04 IT
1 2 Nimesh 878.4 2017-09-24 IT
2 3 Sara 611.0 2014-11-15 HR

Importing data from JSON File

JSON is a syntax for storing and exchanging data. JSON file stores data as text in human-readable format. JSON stands for JavaScript Object Notation. Pandas can read JSON files using the read_json function.

Let’s create a JSON file, open your notepad copy paste the following text and save it as datainput.json file.

{ 
“ID”:[“1”,”2",”3",”4",”5",”6",”7",”8" ],
“Name”:[“John”,”Doe”,”Smith”,”Carlo”,”Mathew”,”Cathey”,”Paul”,”Bob” ],
“Salary”:[“623.3”,”515.2",”611",”729",”843.25",”578",”632.8",”722.5" ],
“StartDate”:[ “1/1/2013”,”9/23/2014",”11/15/2013",”5/11/2014",”3/27/2015",”5/21/2013",”7/30/2013",”6/17/2014"],
“Dept”:[ “IT”,”Operations”,”IT”,”HR”,”Finance”,”IT”,”Operations”,”Finance”]
}

Now let’s read.

import pandas as pd
data = pd.read_json('D:\datainput.json')
print (data)

Output:

  ID Name Salary StartDate Dept
0 1 John 623.30 1/1/2013 IT
1 2 Doe 515.20 9/23/2014 Operations
2 3 Smith 611.00 11/15/2013 IT
3 4 Carlo 729.00 5/11/2014 HR
4 5 Mathew 843.25 3/27/2015 Finance
5 6 Cathey 578.00 5/21/2013 IT
6 7 Paul 632.80 7/30/2013 Operations
7 8 Bob 722.50 6/17/2014 Finance

Reading Specific Columns and Rows

import pandas as pd
data = pd.read_json('D:\datainput.json')
#Reading Specific Columns and Rows
print (data.iloc[3:6,1:3])

Output:

  Name Salary
3 Carlo 729.00
4 Mathew 843.25
5 Cathey 578.00

Data Cleansing

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In-fact, according to IBM Data Analytics you can expect to spend up to 80% of your time cleaning data.

Before you start cleaning a data set, it’s a good idea to just have a general feel of your data. After that, you can put together a plan to clean that up.

Look for:

  1. What are the features?
  2. What are the expected types (int, float, string, boolean)?
  3. Is there obvious missing data?
  4. Is there other types of missing data that’s not so obvious?

The data we’re going to work with is a very small real estate dataset.

PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
100001000,104,PUTNAM,Y,3,1,1000
100002000,197,LEXINGTON,N,3,1.5,--
100003000,,LEXINGTON,N,n/a,1,850
100004000,201,BERKELEY,12,1,NaN,700
,203,BERKELEY,Y,3,2,1600
100006000,207,BERKELEY,Y,NA,1,800
100007000,NA,WASHINGTON,,2,HURLEY,950
100008000,213,TREMONT,Y,1,1,
100009000,215,TREMONT,Y,na,2,1800

Here’s a quick look at the data:

Let’s start by importing this data:

# Importing libraries
import pandas as pd

# Read csv file into a pandas dataframe
df = pd.read_csv("D:\data.csv")

# Take a look at the first few rows
print (df.head())

Now I can answer my original question, what are my features? It’s pretty easy to infer the following features from the column names:

  • ST_NUM: Street number
  • ST_NAME: Street name
  • OWN_OCCUPIED: Is the residence owner occupied
  • NUM_BEDROOMS: Number of bedrooms

We can also answer, what are the expected types?

  • ST_NUM: float or int… some sort of numeric type
  • ST_NAME: string
  • OWN_OCCUPIED: string… Y (“Yes”) or N (“No”)
  • NUM_BEDROOMS: float or int, a numeric type

To answer the next two questions, we’ll need to start getting more in-depth width Pandas. Let’s start looking at examples of how to detect missing values

Standard Missing Values

So what do I mean by “standard missing values”? These are missing values that Pandas can detect. Going back to our original dataset, let’s take a look at the “Street Number” column.

Clearly these are both missing values. Let’s see how Pandas deals with these.

# Importing libraries
import pandas as pd

# Read csv file into a pandas dataframe
df = pd.read_csv("D:\data.csv")

print (df['ST_NUM'])

So it can be clearly seen that the missing values are detected as NAN. Using the isnull() method, we can confirm that both the missing value and “NA” were recognized as missing values. Both boolean responses are True.

# Importing libraries
import pandas as pd

# Read csv file into a pandas dataframe
df = pd.read_csv("D:\data.csv")

print (df['ST_NUM'].isnull())

Pandas will recognize both empty cells and “NA” types as missing values. In the next section, we’ll take a look at some types that Pandas won’t recognize.

Non-Standard Missing Values

Sometimes it might be the case where there’s missing values that have different formats.

Let’s take a look at the “Number of Bedrooms” column to see what I mean.

In this column, there’s four missing values.

  • n/a
  • NA
  • na

From the previous section, we know that Pandas will recognize “NA” as a missing value, but what about the others? Let’s take a look.

# Importing libraries
import pandas as pd

# Read csv file into a pandas dataframe
df = pd.read_csv("D:\data.csv")

print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())

Output:

0 3
1 3
2 NaN
3 1
4 3
5 NaN
6 2
7 1
8 na
Name: NUM_BEDROOMS, dtype: object
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
8 False
Name: NUM_BEDROOMS, dtype: bool

Just like before, Pandas recognized the “NA” as a missing value. Unfortunately, the other types weren’t recognized i.e 8th row in the above output.

An easy way to detect these various formats is to put them in a list. Then when we import the data, Pandas will recognize them right away. Here’s an example of how we would do that.

# Importing libraries
import pandas as pd
# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("D:\data.csv", na_values = missing_values)# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()

Output:

0 3.0
1 3.0
2 NaN
3 1.0
4 3.0
5 NaN
6 2.0
7 1.0
8 NaN
Name: NUM_BEDROOMS, dtype: float64
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
8 True
Name: NUM_BEDROOMS, dtype: bool

Note — This time, all of the different formats were recognized as missing values.

Unexpected Missing Values

So far we’ve seen standard missing values, and non-standard missing values. What if we have an unexpected type?

For example, if our feature is expected to be a string, but there’s a numeric type, then technically this is also a missing value.

Let’s take a look at the “Owner Occupied” column to see what I’m talking about.

From our previous examples, we know that Pandas will detect the empty cell in row seven as a missing value. Let’s confirm with some code.

# Importing libraries
import pandas as pd

# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("D:\data.csv", na_values = missing_values)

# Looking at the OWN_OCCUPIED column
print (df['OWN_OCCUPIED'])
print (df['OWN_OCCUPIED'].isnull())

Output:

0 Y
1 N
2 N
3 12
4 Y
5 Y
6 NaN
7 Y
8 Y
Name: OWN_OCCUPIED, dtype: object
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
Name: OWN_OCCUPIED, dtype: bool

In the fourth row, there’s the number 12. The response for Owner Occupied should clearly be a string (Y or N), so this numeric type should be a missing value.

There’s a number of different approaches to resolve this issue, but here’s the way that I’m going to work through this one.

  1. Loop through the OWN_OCCUPIED column
  2. Try and turn the entry into an integer
  3. If the entry can be changed into an integer, enter a missing value
  4. If the number can’t be an integer, we know it’s a string, so keep going

Let’s take a look at the code and then we’ll go through it in detail.

# Importing libraries
import pandas as pd
import numpy as np

# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("D:\data.csv", na_values = missing_values)

# Detecting numbers
cnt=0
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED'] = np.nan
except ValueError:
pass
cnt = cnt + 1

print(df['OWN_OCCUPIED'])

Output:

0 Y
1 N
2 N
3 NaN
4 Y
5 Y
6 NaN
7 Y
8 Y

You’ll notice that I used try and except ValueError. This is called exception handling, and we use this to handle errors.

If we were to try and change an entry into an integer and it couldn’t be changed, then a ValueError would be returned, and the code would stop. To deal with this, we use exception handling to recognize these errors, and keep going.

Another important bit of the code is the .loc method. This is the preferred Pandas method for modifying entries in place.

Summarizing Missing Values

After we’ve cleaned the missing values, we will probably want to summarize them. For instance, we might want to look at the total number of missing values for each feature.

# Total missing values for each feature
print (df.isnull().sum())

Output:

PID 1
ST_NUM 2
ST_NAME 0
OWN_OCCUPIED 2
NUM_BEDROOMS 3
NUM_BATH 1
SQ_FT 2

Replacing

Often times you’ll have to figure out how you want to handle missing values. Sometimes you’ll simply want to delete those rows, other times you’ll replace them.

# Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)

More likely, you might want to do a location based imputation. Here’s how you would do that.

# Replace missing values with a number i.e 125
df['ST_NUM'].fillna(125, inplace=True)

# Replace missing values with N
df['OWN_OCCUPIED'].fillna('N', inplace=True)

# Replace missing values with 0
df['NUM_BATH'].fillna(0, inplace=True)

# Replace missing values with 0
df['SQ_FT'].fillna(0, inplace=True)

# Replace using median
median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True)
# drop row with nan
df = df.dropna()

Dealing with messy data is inevitable. Data cleaning is just part of the process on a data science project.

In this tutorial we went over some ways to detect, summarize, and replace missing values.

What More

Here are few additional resources for you to learn and practice:

Closing Remarks:

I hope it was helpful for you all. Feel free to share your ideas.

  • Thanks for reading! If you enjoy reading this post, got help, knowledge, inspiration, and motivation through it, and you want to support me — you can “buy me a coffee.” Your support really makes a difference ❤️
  • Receive an email whenever I publish an article and consider being a member if you liked the story.
  • If you enjoyed this post…it would mean a lot to me if you could click on the “claps” icon…up to 50 claps allowed — Thank You!

--

--

Shahzaib Khan

Developer / Data Scientist / Computer Science Enthusiast. Founder @ Interns.pk You can connect with me @ https://linkedin.com/in/shahzaibkhan/