Removing Unwanted Spaces from Data Frame Records

Brent Smart
3 min readJul 19, 2022

--

Photo by Kenny Eliason on Unsplash

We have been working student data in effort to clean it up and make it ready for building predictive models.

In a previous post, we discussed different ways to format column names. In another past post, we also talked about identifying missing data. While these initials steps work well in cleaning up data, spaces in records can present somewhat of a challenge. Since a space is counted as a character they are included in the count of non-null values with the .info() method and excluded from the .isna().sum(). Let’s preview our student academic records data frame using the .info() method.

Columns 1–4 of data frame before removing spaces from records.

According to the summary, there is only 1 record missing from that entire BOY Indep Level column. However, when the spaces are removed, notice the change in missing records.

Columns 1–4 of data frame after removing spaces from records.

After removing spaces from our data frame, we found 13 additional missing records. How did we achieve this? Essentially, we want to select the record and replace that space with something. In this case, because I have created loops to address null values, I will replace the space with null values using Numpy.nan.

Let’s first print all Beginning Of Year Independent Reading Levels (BOY IP Reading Level) and Middle Of Year Independent Reading Levels (MOY IP Reading Level).

Print out of BOY and MOY IP Reading levels starting at Index 24.

We can replace the spaces in the BOY IP Reading Level for Index 24.

Replacing BOY IP Reading Level at Index 24.
Replacing space in BOY IP Reading Level in Index 24 with null value.

To replace records with a space in that specific column, we could type the following:

df[‘BOY IP Reading Level’].str.replace(“ ”, “”)

The problem here is this will not work for records in that column with more than one space. Additionally, it can be extremely tedious and time consuming to check each record and replace each space with null values, especially if we need to select multiple columns. That’s where we can use a For Loop, select specific columns containing strings, and remove those spaces. If you don’t specify the columns containing strings (these will be your columns of type ‘object’), and simply apply it to the entire data frame, you will get this error message:

(Incorrectly) replacing all spaces in all columns of all types.

If we specify string columns like below using select_dtypes(include=[object]), in addition to regex, we can remove all spaces from these columns. (I included comments explaining what each part does.)

Removing all spaces from string (object) columns.

I prefer this method because it accounts for any number of spaces from all string columns, which simplifies the process. I could have also made a list of columns and loop through that list. You may want to do this if there’s a column that should contain spaces (like a column with names).

Hope this helps! Happy data cleaning!

Reach out with comments and questions: chatconsmartbre@gmail.com

--

--