Photo by Jeremy C on Unsplash

Python basic data cleaning using Pandas

Bernard Kurka

--

Hello everyone! In this post, I’ll show some basic coding examples for you to get started on your data cleaning. Using Pandas, which is a Python library built to simplify data cleaning and transformation. With Pandas, we can import an excel table and assign it a data frame, which is the pandas equivalent for an excel named table.

In this post you will learn:

  1. How to import data from Excel.
  2. Basics on Pandas data frame structure
  3. Select index and column names.
  4. Change a column name
  5. Find null values in the data frame
  6. Replace null values.
  7. Fix wrong data entry

To follow this example you can use an online Jupter notebook available on https://jupyter.org/try.

The Data:

Image from Pixabay

I used a dataset with information on United States presidents that are currently alive. I gathered the data from Wikipedia and then saved it in an excel file in the same folder as the Jupiter notebook. If you don’t know what is Jupyter notebook you can try it online for free here. The data is excel looks like this:

Data in the excel spreadsheet.

Let’s get Started with Python!

  1. How to import data from an excel into a Jupyter notebook.

The Excel File is available in my GitHub.

import pandas as pd
df = pd.read_excel("./presidents_names.xlsx")
df

If you don’t have the excel file you can use this to import the data:

import pandas as pd
data_dic = {'Name': {0: 'Jimmy C@rter', 1: 'Donald Trump', 2: 'George W. Bush', 3: 'Bill Clinton', 4: 'Barack Obama'}, 'born': {0: Timestamp('2024-01-10 00:00:00'), 1: Timestamp('1946-06-14 00:00:00'), 2: Timestamp('1946-07-06 00:00:00'), 3: Timestamp('1946-08-19 00:00:00'), 4: Timestamp('1961-08-04 00:00:00')}, 'in office': {0: 1977, 1: 2017, 2: 2001, 3: 1993, 4: 2009}, 'in office.1': {0: 1981.0, 1: nan, 2: 2009.0, 3: 2001.0, 4: 2017.0}}
df = pd.DataFrame(data_dic)
df
Jupyter notebook imported Data Frame

In the first line, I’ve imported Pandas and assigned a symbolic name also know as “Alias”. The variable pd will refer to pandas objects. Second line imports from excel file and stores the data in a variable called df. Finally, the third line code will make Jupiter display df once the code is executed. Another way to make the display would be to use display(df).

Note that "./presidents_names.xlsx" is the relative path. The other option would be using an absolute path (for example: "C:\Users\berku\Google Drive\presidents_names.xlsx") it’s best to use the relative path.

2. Basics on pandas data frame structure

Pandas data frame can be broken down into four parts: Index, Column Values, Column names, and Row.

Pandas Data Frame Structure

Index: The Index numbers were created automatically by the pd.read_excel, you can change Index values if you wish, in this post I will stick to the default index throughout the post. In this case, the index name is empty.

Column Values: The values contained in a single column is also known as Pandas Series. A Panda Series is an object that contains a series of indexes and values.

Column Names: Names used for each column data, it’s a best practice to remove special characters and spaces from column names.

Row: The row is also a Panda Series.

3. Selecting index and column names.

Selecting Index labels:

df.index.values
results from Jupyter notebook

This code creates a list of index labels. If Index labels are not numerical, you can a select them using only the code df.index .

Selecting Column names:

df.columns().values
results from Jupyter notebook

This creates a list of column names.

4. Change a column name

In this example I will replace:

"in office" by "office_start_year"

"in office.1" by office_end_year"

col_dic = {'in office': 'office_start_year', 'in office.1':            'office_end_year'}
df.rename(col_dic,axis=1, inplace=True)
df
results from Jupyter notebook

I created a dictionary called col_dic with the old name and the new one, and then I used the data frame rename attribute to rename the axis=1 (columns) labels. Also used the code inplace=True to make the changes overwrite the existing labels. An alternative solution would be:

df.columns = ['born','office_start_year','office_end_year']

5. Find null values in the data frame

df.isnull().sum()
results from Jupyter notebook

This will point exactly which columns have the null values.

6. Replace null values

df['office_end_year'].fillna(0, inplace=True)
df
results from Jupyter notebook

This code will replace only null values in the column named office_end_year . If I used, df.fillna(0,inplace=True) I would replace all null values in the data frame.

7. Fix wrong values

There is an incorrect data entry in Jimmy Carter’s name (row label = 0, column label = Name). There are two ways to fix it.

df['Name'] = df['Name'].str.replace("@","a")df.loc[0,"Name"] = 'Jimmy Carter'

The first line will replace the character “@with the letter “afor all values in the Name column. The second one will set the value of the data frame position equal to Jimmy Carter. I will explain in detail the .loc attribute in the next post you can read more about it here.

You can download the code and the excel data here.

Just the beginning

There is much more to cover on how to wrangle data in pandas.

Thanks for reading.

--

--

Bernard Kurka

Passionate about science, technology, and business. I love to use technology to solve problems and help people.