(Nice) Ways to Break Up

Using Regular Expressions in .Split


Splitting up is hard.

So. Hard.

But splitting up in Ruby is easy!

WRONG

Lets say you are working with a simple string.

“I love dogs, I love pets, but like… from afar.”

So that string would be simple to split right?

> “I love dogs, I love pets, but like… from afar.”.split
=> [“I”, “love”, “dogs,”, “I”, “love”, “pets,”, “but”, “like…”, “from”, “afar.”]

And you are probably already familiar with splitting at one delimiter. (from now on points where you want to split will be referred to as a ‘delimiter’)

> “I love dogs, I love pets, but like… from afar.”.split(“,”)
=> [“I love dogs”, “ I love pets”, “ but like… from afar.”]

So what if you want to split the string at multiple delimiters?

“I love dogs, I love pets, but like… from afar.”.split(“,”,”.”)
TypeError: no implicit conversion of String into Integer

Welp, that didn’t work…

You could just try finding and replacing the second delimiter (in this case a period.) with the first with .gsub and then splitting. But where is the fun in that? Programmers are supposed to be lazy. So instead we just google for hours to learn something new so we can type less.

That’s where regular expressions come in.

What is a regular expression?

“A regular expression (regex or regexp for short) is a special text string for describing a search pattern.”

What do they mean by “special”?

Regular Expression:
^([a-zA-Z0–9_\-\.]+)@([a-zA-Z0–9_\-\.]+)\.([a-zA-Z]{2,5})$
Easy expression that checks for valid email addresses.

Regular Expressions are kinda terrifying…But they are a powerful tool for working with text, making splitting up data and formatting easy.

A regular expression is a way of specifying a pattern of characters to be found in a string. In Ruby, a regular expression is written in the form of /pattern/modifiers where “pattern” is the regular expression , and “modifiers” are a characters indicating various options. The “modifiers” part is optional.

You create a regular expression by writing a pattern between slash characters (/pattern/). Ruby supports regular expressions as a built-in feature. In Ruby, regular expressions are actually objects. // is a regular expression and an instance of the Regexp class.

//.class => #Regexp

Using the .split() method with regex, e.g. mystring.split(/regex/) is simple. To do this we need to specify a regular expression.

In this case to use the regular expression which represents the point between two characters (//) and anything we put between the (//) is an argument that is passed in to the split method:

> “I love dogs, I love pets, but like… from afar.”.split(//)
=> [“I”, “ “, “l”, “o”, “v”, “e”, “ “, “d”, “o”, “g”, “s”, “,”, “ “, “I”, “ “, “l”, “o”, “v”, “e”, “ “, “p”, “e”, “t”, “s”, “,”, “ “, “b”, “u”, “t”, “ “, “l”, “i”, “k”, “e”, “.”, “.”, “.”, “ “, “f”, “r”, “o”, “m”, “ “, “a”, “f”, “a”, “r”, “.”]

We can also split using the space (/ /) as the break point:

> “I love dogs, I love pets, but like… from afar.”.split(/ /)
=> [“I”, “love”, “dogs,”, “I”, “love”, “pets,”, “but”, “like…”, “from”, “afar.”]

Or split by a comma separated string:

> “I love dogs, I love pets, but like… from afar.”.split(/,/)
=> [“I love dogs”, “ I love pets”, “ but like… from afar.”]

You can also save the delimiters you are looking for by wrapping them in parentheses

> "I love dogs, I love pets, but like... from afar.".split(/(,)/)
=> ["I love dogs", ",", " I love pets", ",", " but like... from afar."]

We can split by two delimiters separated by a pipe character (|). The pipe means “either the thing on the right or the thing on the left,”(I added a “-” to the sentence for this example):

> “I love dogs, I love pets, but like… from-afar.”.split(/,|-/)
=> [“I love dogs”, “ I love pets”, “ but like… from”, “afar.”]

In our case our string is “I love dogs, I love pets, but like… from afar.” Which contains periods as punctuation. Periods are considered a special character, and has to be treated differently. Some characters have special meanings in regex. When you want to use one of these special characters, you have to escape it with a backslash (\) or put it between brackets.

This works:

> “I love dogs, I love pets, but like… from afar.”.split(/[.]/)
=> [“I love dogs, I love pets, but like”, “”, “”, “ from afar”]

Or this:

> “I love dogs, I love pets, but like… from afar.”.split(/\./)
=> [“I love dogs, I love pets, but like”, “”, “”, “ from afar”]

But not this:

> “I love dogs, I love pets, but like… from-afar.”.split(/./)
=> []

The backslash means “don’t treat the next character as special; treat it as itself.”

The special characters include ^, $, ? , ., /, \, [, ], {, }, (, ), +, and *.

So to split at multiple delimiters you would simply do this:

> “I love dogs, I love pets, but like… from afar.”.split(/[.,]/)
=> [“I love dogs ”, “ I love pets”, “ but like”, “”, “”, “ from afar”]

Or

> “I love dogs, I love pets, but like… from afar.”.split(/[.|,]/)
=> [“I love dogs”, “ I love pets”, “ but like”, “”, “”, “ from afar”]

The .split(/regex/) method discards all regex matches, returning the text between the matches.

But that leaves you with empty quotes and trailing spaces, and we don’t want that.

It’s ok, there is regex for that too.

Adding a + as a modifier after your regex argument to remove the empty strings, and says that something will happen more than once.

“I love dogs, I love pets, but like… from afar.”.split(/[.|,]+/)
=> [“I love dogs”, “ I love pets”, “ but like”, “ from afar”]

Using \s to remove trailing or leading white spaces (this will also remove the empty strings). But without the + then we aren’t getting all the delimiters…

> "I love dogs, I love pets, but like... from afar.".split(/[.,]\s/)
=> ["I love dogs", "I love pets", "but like..", "from afar."]

Put it all together

> “I love dogs, I love pets, but like… from afar.”.split(/[.|,]+\s/)
=> [“I love dogs”, “I love pets”, “but like”, “from afar”]

So to break this down

[.|,] :Split at either “ . ”or “ , ”
+ :This may happen more than once
\s :Remove the whitespace

Omg! You broke up! It was so easy!

Sources: