R4DS Week 7: Star Wars, regex, and string manipulation


Up until this past week, my entire strategy for using regular expressions (shortened to regex or regexp, but not reprex) has literally been to Google something along the lines of “replace ‘xyz’ with ‘abc’ string R”, which almost always returns a regex solution to my problem.

After typing in and running what looks like absolute nonsense, I end up with a solution that works, but I never really know why.

We’re doing this!

What is a regular expression?

Regular expressions are concise bit of language that can be used to describe a pattern within strings. Another way of saying this is to refer to regular expressions as a specific arrangement of text that represent a pattern we’re trying to match within our dataset.

We’ll use a metaphor to help clarify things:

Let’s say that your friend has asked you to go to the grocery store and pick up a package of Oreos.

  • The grocery store is your dataset
  • The package of Oreos is the pattern you need to match

Now you or I as humans would walk into the grocery store and head to the cookie aisle, which seems like a reasonable place to find cookies.

But if the grocery store is our dataset, and we were to behave as computers, we’d walk into the grocery store and look at the first item we encounter in order to see if it matches our pattern, a package of Oreos. If it doesn’t, we look at the next item. In fact, we’d look at every single item in the grocery store until we found a pattern that matched.

A package of Oreos is a pretty broad category, considering that you can get Oreos in different flavors, shapes, and sizes. So we can use regexes to be as broad or as specific as we want in our pattern matching.

  • Broad pattern match: find me a package of Oreos
  • More specific pattern match: find me a package of round, regular-sized Double Stuff Oreos
  • Highly specific pattern match: find me all of the packages of round, regular-sized Double Stuff Oreos that are immediately to the right of Pepperidge Farm Milano cookies. When you find packages of Oreos that fit this description, take them off of the shelf and replace each package with a one pound box of Land O’Lakes unsalted butter.

OK, but WHY do we need to learn and use regexes?

There are a lot of reasons to use regular expressions when working with text-based data, and this is one of my favorite summary statements:

REs stands for regular expressions

To be more functional in our definition, we could say that regexes allow us to find something in our data and then sometimes do something with what we’ve found.

A non-exhaustive list of examples of what “sometimes do something with what we’ve found” may look like:

  • Changing strings to all caps, all lower case, or all title case, like when we want “APPLES”, “apples”, and “ApPlEs” to all be converted to “apples”.
  • Removing pieces of strings in a particular location, like when our column headers have information we don’t need or want.
  • Editing pieces of strings, like when we want to find and replace information.

If your data originates in Excel, one way to know if regular expressions would be helpful is if you find yourself doing a lot of “find and replace” within your Excel spreadsheet before bringing your data into R.

If you’d like a more exhaustive list of examples for the how’s and why’s of using regular expressions, these are two resources that may be helpful:

Crowdsourcing the how’s and why’s of regexes in R:

When in doubt, ask the Twitterverse:

Read a book on string manipulation and regexes in R:

Even if you don’t feel up to reading the entire “Handling and Processing Strings in R” text (but if you do, let’s talk about it on Twitter!), check out the Preface on page 7, which contains insights like the following:

Yes, it is true that you won’t get a Michelin star for processing character data. But you would hardly become a good data cook if you don’t get your hands dirty with string manipulation. And to be honest, it’s not always that boring. Whether you like it or not, no one should ever claim to be a data analyst until he or she has done string manipulation. — Gaston Sanchez

Quick refresher: what is a string?

A string is a sequence of characters that create a type of data that we can work with in R. Within R, strings belong to the class known as characters.

Strings can consist of numbers, letters, spaces, and even some kinds of punctuation. When we create a string in R, we wrap everything in double quotation marks, like this "these words are considered a string".

Note: we’re not going to cover escapes — read through section 14.2 of “R for Data Science” for more information on how to do this.

# run the following in R:
# R recognizes the number 2 as a number when the number is not wrapped in quotes
> 2
# Double check this by running :
> class(2)
# Now wrap the number 2 in double quotation marks:
> "2"
# Double check that this is a string by running:
> class("2")

How do you know if you have strings in your dataset?

# run the following in R:
> head(starwars)

Your output should look something like this:

output from head(starwars) from within RStudio

Underneath each column name is an abbreviation enclosed within the less than and greater than symbols, like this: <abbreviation>. When we see <chr> we know we’re working with character data, or strings!

So are you ready to dive into stringr? Of course you’re ready.

Let’s go!

The many functions of stringr

What is stringr?

stringr is a package in R that is designed to make working with strings easier by providing a consistent set of rules and conventions. stringr is considered part of the tidyverse, but we have to load it separately, like this:

> library(stringr)
# if you get an error from running the above code, run the following:
> install.packages("stringr")
> library(stringr)

When (and how) do I use stringr?

We use the stringr package to manipulate strings within our dataset, often through the use of regular expressions. We’ll walk through several of the common use cases below in order to help you get your bearings with regexes. Our goal is to get familiar with the basics of string manipulation with only a light use of regexes so that you have a broad understanding of what’s possible.

When you want to change the case of your text

# template: str_to_upper(string, locale = "en")
# examples:
> str_to_upper(starwars$name, locale = "en")
> str_to_upper(starwars$homeworld, locale = "en")

# let's change everything to lowercase
# template: str_to_lower(string, locale = "en")
# examples:
> str_to_lower(starwars$name, locale = "en")
> str_to_lower(starwars$homeworld, locale = "en")

# Let's Change Everything To Title Case
# template: str_to_title(string, locale = "en")
# examples:
> str_to_title(starwars$name, locale = "en")
> str_to_title(starwars$homeworld, locale = "en")

The above code can be helpful, but type in the following:

> str_to_lower(starwars$name, locale = "en")

Notice that the output has converted all of our names to lower case — fantastic!

Now run head(starwars).

Hmm. All of our names are back to title case. This is because we didn’t permanently change our data. To help work through the next set of examples, we’re going to create a modified version of the starwars dataset to help illustrate each point. To do so, run the following:

# create two new columns that have the case changed
> star_wars <- starwars %>%
> mutate(
> name_lower = str_to_lower(name),
> eye_color_upper = str_to_upper(eye_color)
> )
# we dropped the locale = "en", in str_to_lower and str_to_upper, 
# as locale = "en" is the default
# check to see that we've created new columns
> names(star_wars)
# condense our data to contain only the columns we're interested in
> star_wars <- star_wars[c("name", "eye_color", "name_lower", "eye_color_upper")]
# check the first few rows of our data to see if everything looks as we expect it to
> head(star_wars)

For the rest of these exercises we’ll be using our newly created star_wars data!

When you want to count the number of times a pattern is matched within a string

# Let's start with a simple pattern, and search for the letter "e"
# template: str_count(string, pattern)
> str_count(star_wars$name, pattern = "e")
> str_count(star_wars$name_lower, pattern = "e")
> str_count(star_wars$eye_color, pattern = "e")
> str_count(star_wars$eye_color_upper, pattern = "e")
# we can wrap our str_count in other functions to get the following:
# how many times does the letter "e" appear in total?
# template: sum(str_count(string, pattern)
# notice that we can drop "pattern = " without consequence
> sum(str_count(star_wars$name, "e"))
> sum(str_count(star_wars$name_lower, "e"))
> sum(str_count(star_wars$eye_color, "e"))
> sum(str_count(star_wars$eye_color_upper, "e"))
# what is the average number of times that the letter "e" appears per word?
# template: mean(str_count(string, pattern)
> mean(str_count(star_wars$name, "e"))
> mean(str_count(star_wars$name_title, "e"))
> mean(str_count(star_wars$eye_color, "e"))
> mean(str_count(star_wars$eye_color_upper, "e"))

Stretch exercises:

  1. Run the same 12 lines of code above, but replace “e” with “E”
  2. Run the same 12 lines of code above, but replace “e” with “aeiou”
  3. Run the same 12 lines of code above, but replace “e” with “[aeiou]”
  4. Run the same 12 lines of code above, but replace “e” with “[^aeiou]”
  5. Use the “R for Data Science” text to determine what each replacement did

When you want to remove the pattern, or replace a pattern with something else

# Let's start with a simple pattern, and replace the letter "e" with "K2So"
# template: str_count(string, pattern, replacement)
> str_replace(star_wars$name, "e", replacement = "K2So")
> str_replace(star_wars$name_lower, "e", replacement = "K2So")
# We can drop "replacement = " without consequence
> str_replace(star_wars$eye_color, "e", "K2So")
> str_replace(star_wars$eye_color_upper, "e", "K2So")

Let’s examine the results of str_replace a little more closely:

Hmmm… not every “e” was replaced, just the first “e” in every name

To get all of the lowercase “e”’s replaced, we’ll need to replace str_replace with str_replace_all.

Stretch exercises:

  1. Run the same four lines of code above, but replace “e” with “E”
  2. Run the same four lines of code above, but replace “e” with “aeiou”
  3. Run the same four lines of code above, but replace “e” with “[aeiou]”
  4. Run the same four lines of code above, but replace “e” with “[^aeiou]”
  5. Use the “R for Data Science” text to determine what each replacement did
  6. How would you remove a letter or pattern from a string? You have all the skills you need to do this! (Hint: you’re finding a pattern and then replacing it with nothing.)
  7. How would you remove the space between the first and last names of the characters?

Want more practice with strings?

Jenny Bryan has a phenomenal walk-through up on her STAT 545 course, which picks up nicely from the set of exercises we just completed.

Closing thoughts on strings, stringr, and regular expressions

Strings are one of the first things novice programmers encounter, often in some iteration of a"Hello, world!" tutorial. String manipulation usually follows shortly thereafter, but often without any context as to why you’re manipulating strings, or how this could be a useful skill.

There were any number of approaches that could be taken in introducing string manipulation using regular expressions, and my first drafts were essentially a set of increasingly complex coding exercises that were a deep dive into every possible iteration of regexes.

But that approach didn’t serve my ultimate goal of providing a better understanding of when manipulating strings with regular expressions was called for, and then creating the scaffolding for you to get started by using some relatively straightforward examples.

I would love to hear from you on whether or not this approach was helpful, what made it helpful, and what improvements would make it more helpful to you.

Don’t be scared of regular expressions — YOU’VE GOT THIS!