Regular Expressions: Making Your Life Easier and Your Code Harder to Read!
It’s an unfortunate truth that I have come across a less-than-agreeable part of my CS and coding knowledge through a less-than-traditional route. During my undergraduate term at Rutgers University I took a semester of intro to CS as an elective where I learned Java as my first language, then followed that up with a small foray into Data Structures before dropping it to concentrate on finishing up my major thesis. By that time however, I had made several friends who were CS majors, and it was at this point that I was introduced to my primary source for computer terminology and self-learning. My non-traditional source? Inside jokes, and comics…
Naturally as a complete amateur at the time, most coding jokes went right over my head (and still do), but surprisingly they turned out to be an excellent resource for finding terms and concepts to look up and research on my own. Dialoguing with people in the field of Computer Science and making an effort to educate myself to contribute in the circle is one of the major reasons that I continued to delve into the field and eventually find my love for its logic and languages.
Being able to read about a concept or language (humorous or otherwise) and not understand a word of it, but understand that these unknown concepts are now no longer out of reach was a unsuspected source of drive and education for me in my early periods of coding.
Now, on the note of encountering intimidating unknown material and back to the point of this article…
Regular Expressions!
If you have never encountered or heard of a regular expression before (regex for short), I’ve provided here a quick snippet of a regular expression example to help clarify their purpose.
Easy to follow right? No, it is not. Regular expressions at first glance are some of the most syntactically unfriendly looking code snippets I’ve ever encountered. They look like they belong in the realm of machine and assembly languages, but almost worse (Assembly Language). I first encountered regular expressions years ago in a similar fashion to my prior story—through a comic.
What I gathered from this bit the first time I read it was that regular expressions were probably a powerful tool for sorting through otherwise unwieldy data, and that it might be hard considering the self described heroism of being able to swoop in and parse through 200MB of emails with a few clicks.
How valorous of our wondrous regex experts to grace us with such a powerfully niche toolset, we are humbled truly! Such was my process at the time I looked up an example of a regex online. Then, I quickly closed the window and never went back to it until two years later when I began my term at Flatiron School. I finally knuckled down and decided to learn them. Having learned them now I can say with confidence:
Regular expressions look much harder than they are
If you learn them, they will almost never be required over other tools
But if you know them, you will constantly find situations where they apply, and they will make many problems much, much easier
So, how do they work?
What They Do
Regular expressions are a language/tool that excel at finding patterns in strings, and they are often used to search/parse through them. That is their primary purpose and really the only one we are likely to use them for. But they are very, very good at what they do. Here’s how they work:
Regex format will vary slightly between some languages and engines, but they are remarkably consistent across most coding languages they appear in (which is most of them).
They are typically flanked by /s. This is one of the most common ways a coding language identifies a series of characters to be interpreted as regex. For example:
/insert your regex code here/
Now, inside a regex expression, you define the details of a chunk of string that you are looking for (e.g. your specs for the regex to look for). so if I were to type:
/a/
This regex would look for any ‘a’ in the string I am comparing to. Regex are also read left-right, so:
/ab/
Would look for any ‘ab’ in my string. Easy peasy, but not very useful. So why are they so powerful? Regex has special characters that can generalize and change our search for us. So say for example I want to find exactly ten a’s consecutive. Well I can do:
/aaaaaaaaaa/
Or, more likely, I’d do this:
/a{10}/
Now we’re getting somewhere. So how do these interact with each other? Well at the end of the day a regex is still always read left-to-right. So say I only wanted ten a’s followed by three w’s and if it ends in a number, I want to include that number too optionally:
/a{10}w{3}\d?/
Boom. Now who knows why we would search this, but this means “find me substrings where there’s exactly ten a’s, immediately followed by exactly 3 w’s, then maybe a number (\d is all digits) but thats optional (the ? means one or none of the character before it)”
Lastly, and bear with me on this, what if I wanted to find all instances where a number was followed by a space, one or more words, then a comma and space, then more words, then a comma and space, then some combination of two capital letters, a space, then a number of exactly five digits. That was a mouthful but deep breathes…
/\d+ (([a-zA-Z]+ )*[a-zA-Z]+, ){2}[A-Z]{2} \d{5}/
27 New Court Drive, Manchester, PA 90625
Holy moly it’s gross, but what we’ve just done with our regular expression tools is create a line that will match with every United States address formatted like the one above. And that is incredibly powerful.
The secret to regex’s strength, and the intimidating side of it, comes from the wide array of special characters and expressions that you can chuck into a query to edit your search. However, for the most part they’re very intuitive once you start learning them, and they’re always available online to look up.
Some of the most frequent special characters that you’ll use include:
\d =>digits
\w =>characters
\s =>whitespaces
+ =>one or more
* =>none or more
? =>once or none
{} =>number of occurrences or range
. =>any character except line break
[] =>collection/range of characters
() =>group of exclusive options
(?=something) =>Positive lookahead
(?<=something) =>Positive lookbehind
So maybe the power of regular expressions is starting to clarify, but some real world examples might help too. Below I’ve written out some examples of real-world application regex expressions:
/^1?-?\d{3}-\d{3}-\d{4}$/ ~Phone number
/^M(rs|[rs]).?$/ ~Match English name prefixes
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{6,}$/~Password Val
/(?<=[.?!]) {2,}(?=[A-Z])/ ~Two space sentence break
I won’t breakdown the logic of all of these, but the list above is a handful of the examples I’ve practiced within the first week of learning Regex: some from videos, some from articles, mostly from Codewars challenges and course assignments. Regex expressions can be used to check for existing matches. They can be passed into a split method to provide flexibility to large string splits, and they can be used to scare rubyists. Their power is substantial and real.
And the big secret is that you don’t have to be an expert to start using them quickly and easily; there are many online tutorials and references that can be accessed to learn their basic syntax and refer back to some ambiguous special character that you can’t remember.
Regex expressions look gross, and they are gross, because they’re so useful :D. But seriously, I regret not jumping in and learning them sooner, because logically they are no more difficult than any coding language, and syntactically they end up being very intuitive and easy to look up. I highly recommend this youtube video by Corey Schafer if you want the quickest easiest route to picking up.
Don’t be afraid of regex and its intimidating syntax — be interested in its power and potential as another tool on your coding belt. Besides, despite the (sometimes questionable) origin of any of our coding knowledge-base, we all learned to write code and develop because we weren’t afraid to dive into something ugly and new that we didn’t understand. And with hard work and Google, maybe we can not understand difficult things a little bit less.