Introduction to Regular Expression in R

Mehmet Ali Erkan
4 min readJun 25, 2024

--

Regular expression, a concise and powerful language for describing patterns within strings. In this story, I’ll start with the basic of regular expressions and the most useful stringr functions for data analysis.

In this article, I’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse.

#import library
library(tidyverse)

Through this article, I’ll use a mix of very simple inline examples so everyone can get the basic idea, and apply some functions on their own strings that create for themselves.

Lets’s create some strings:

#create strings
name <- "mehmet ali"
surname <- "erkan"
favorite_song <- "burning of the midnight lamp"
favorite_film <- "stalker"
favorite_team <- "fenerbahce"
favorite_city <- "antalya"

#combine them
favorite <- c(name, surname, favorite_song, favorite_film, favorite_team,
favorite_city)

Firstly, I’ll use str_view() to learn how regex patterns work. When this is supplied, str_view() will indicate only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.

The simplest patterns consist of letters and numbers which match those characters exactly. For example, let’s find the some words that includes. “er ” in favorite.

#includes "er"
str_view(favorite, "er")
## <er>kan
## stalk<er>
## fen<er>bahce

As you can see, the function finds “er” in three words and surrounding each match with <>.

Letters and numbers exactly in regular expressions and are called literal characters. Most of the these character, like ., +, *, ? and []have a special meanings and these are called metacharacters.

  • . will oveany character, so “a.” will match any string that contains “a” followed by another character. Let’s give an example:
#start with a
str_view(favorite, "a.")
## mehmet <al>i
## erk<an>
## burning of the midnight l<am>p
## st<al>ker
## fenerb<ah>ce
## <an>t<al>ya

or it can be found all the name contains an “i”, followed by a letters such as “g”.

#start with "i" then followed by a letters and "g" respectively.
str_view(favorite, "i.g")
## burn<ing> of the midnight lamp
  • + matches one or more occurrences of the preceding character or group. For example, a+ will match one or more occurrences of the letter 'a'. So, it would match "a", "aa", "aaa", and so on, but not an empty string.

For instance, al+ matches the character ‘a’ followed by one or more occurrences of the character ‘l’. This means that it will match “al”, “all”, “alll”, etc.

str_view(favorite, "al+")
## mehmet <al>i
## st<al>ker
## ant<al>ya
  • * matches zero or more occurrences of the preceding character or group. For example, a* will match zero or more occurrences of the letter 'a'. So, it would match "", "a", "aa", "aaa", and so on.

So, al* matches:

  • ‘a’ followed by zero or more ‘l’s.
  • It will match “a”, “al”, “all”, “alll”, et
str_view(favorite, "al*")
## mehmet <al>i
## erk<a>n
## burning of the midnight l<a>mp
## st<al>ker
## fenerb<a>hce
## <a>nt<al>y<a>
  • ? matches zero or one occurrence of the preceding character or group. For example, colou?r would match both "color" and "colour". The u preceding the ? is optional.

In below, the pattern m?m matches:

  • “m” (where the first ‘m’ matches zero times)
  • “mm” (where the first ‘m’ matches one time)
str_view(favorite, "m?m")
## <m>eh<m>et ali
## burning of the <m>idnight la<m>p
  • [ ] Square brackets are used to define a character class. They match any single character within the brackets. For example, [aeiou] matches any vowel, [0-9] matches any digit, and [A-Za-z] matches any alphabetic character.

This idea can be used to find the words containing a “n” surrounded by vowels, or a “d” surrounded by consonants:

str_view(favorite, "[aei]n[aei]")
## f<ene>rbahce
# [^mei]l[^mei] matches a sequence where 'l' is surrounded by characters that are not 'm', 'e', or 'i'.
str_view(favorite, "[^mei]l[^mei]")
## burning of the midnight< la>mp
## st<alk>er
## ant<aly>a

Moreover, let’s talk about the alternation. Alternation can be used to pick between one or more alternative patterns. For example, the following patterns look for some meaningful words in our set.

str_view(my_favorite, "ali|burn|night")
## mehmet <ali>
## <burn>ing of the mid<night> lamp

In this article, I’ve introduced the fundamentals of regular expressions in R using stringr, a core tool in the tidyverse. Regular expressions are powerful for pattern matching in text data, enabling efficient extraction and manipulation based on specific criteria. By mastering these techniques, analysts can enhance their data processing and analysis capabilities significantly.

Thank you!!

Note: For the practice, you can use the kaggle salary dataset.

Example: Please find the name of countries that includes d between letters “a”, “e”, “e”, “f”, and “g” .

salary <- read.csv("salary.csv")

str_view(unique(salary$native.country), "[aeifg]d[aeifg]")
## Can<ada>
## Trin<ada>d&Tobago

References:

Ayessa. (2022). Salary Prediction & Classification. Kaggle, from https://www.kaggle.com/datasets/ayessa/salary-prediction-classification

Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.

--

--