stringr + regex = amazing

I had a data request come through that would have me break down submitted forms by the topic of concern (i.e., why was the form submitted in the first place?). I remember from my previous work with this database, that the form fields were stored in key value pairs in one column. This was ok, but then I remembered, that each form was different because the application only stored the fields that were turned on (checked).

After searching the web, I came across a Stack Overflow question where Hadley Wickham mentioned using his stringr package. This was great, except I had no real clue how to use Regex — I’ve used it once, but it was in a DataCamp tutorial where I had to copy and paste the expression in and hit submit. This is when I ran across This site allows you to paste your string into a text box and then create an expression to see what it matches.

Screenshot of website

I was able to create an expression paired with str_match_all() function from stringr that extracted the topic of concern selection from this gigantic string and add it to a new column. In addition to the code below, I created 6 other for loops to extract main subtopics from the this extracted string. Once that was complete, it was off to Excel to recode the extracted values to their actual values.

data2 <- data %>% filter(type == "NOC")

for (i in 1:nrow(data2)) {
data2$dataExtReason[i] <- str_match(data2$formdetails[i], '"toc".+\\}\\]\\}')

Originally published at what do the data say?.