Removing characters before, after, and in the middle of strings

This Time Is Different
3 min readOct 30, 2017

--

When working with real-world datasets in Python and pandas, you will need to remove characters from your strings *a lot*. I find these three methods can solve a lot of your problems:

.split() #splits the string into two tuples around whatever character it was given and deletes that character..lstrip() #strips everything before and including the character or set of characters you say. If left blank, deletes whitespace..rstrip() #strips everything out from the end up to and including the character or set of characters you give. If left blank, deletes whitespace at the end.

The python documentation is here.

Let’s walk through how we might use these commands. I’ll be using the Titanic dataset on Kaggle, and I want to get the first name from every passenger.

The first name on the list is:

'Braund, Mr. Owen Harris'

We only want the name “Owen” because we are only interested in first names. Every male name on the passenger list had the following pattern (except for the women, we’ll get to that in a bit):

Last Name, Salutation. First_Name Last_Name

Below is the solution in one line of code, after that, I’ll break it down step by step.

name = 'Braund, Mr. Owen Harris'
first_name = name.split('.')[1].lstrip().split(' ')[0]

That returns:

'Owen'

That’s a lot of code! Let’s walk through those commands step by step.

Since every name contains a salutation, and thus a period, before the first name, we can get rid of everything up to and before the period. So first, we split the name into two strings.

splitting = name.split('.')
print(splitting)
['Braund, Mr', ' Owen Harris']

We only want ‘Owen Harris’, that first part is useless. So I just call the second string within the list, which is at index position 1.

First_Last = splitting[1]
print(First_Last)
' Owen Harris'

But we don’t want that ugly blank space at the beginning. That’s why we use .lstrip(). In this case, we just want to remove blank space, so no need to pass it any arguments.

First_Last = First_Last.lstrip()
print(First_Last)
'Owen Harris'

Now we want to get rid of the last name. We do that using .split() again. This time is easier because there is always a space between the first and last name. Because .split() removes the character that’s being passed, that means we’ll be left with just the first and last names, no spaces. From there we just want the first item, at index position 0.

First_Last_Split = First_Last.split(' ')
print(First_Last_Split)
['Owen', 'Harris'] # so let's just grab 'Owen'
First = First_Last_Split[0]
print(First)
'Owen'

So one more time, putting the whole thing together into one line.

first_name = name.split('.')[1].lstrip().split(' ')[0]
print(first_name)
'Owen'

Bonus: .rstrip()

So that solves the problem for the men in the dataset, but now let’s look at the female names. They look like this:

female_name = 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

So the name we want here is Florence, that is the passenger’s first name, but she’s listed under her husband’s name. Here’s how we get just her name:

female_name= .split('.')[1].lstrip().split('(')[1]

That gets us to‘Florence Briggs Thayer)’

almost_there = female_name.rstrip(')')

That gets us to ‘Florence Briggs Thayer’ because .rstrip() works like .lstrip() but at the end of the word.

female_first_name = almost_there.split(' ')[0]
print(female_first_name)
'Florence'

--

--

This Time Is Different

Journal of Alexander Holt. Data science, economics, everything else.