Regex for Dummies

Hey there!

Regex is something that scares a lot of novice programmers. With this post, I make an attempt at clearing the air around Regex and make it accessible to everyone in a more hands-on way.

Regex? What is this Behemoth?

Regular Expressions or Regex in short, can be best described in easy language as a means to refer to how a pattern can be generated and what all patterns will be accepted as a legit string by the Regex itself. For example, dog is also a regex which matches the string ‘dog’.

Regexes can become scary real fast. Hence, it’s always better to learn with examples. I have always preferred regexr.com to experiment with anything related to regexes.

Characters and Groups

Everything that’s on your keyboard (ASCII characters) is usually used as the alphabet for your Regex. (Practically, your keyboard icons won’t work that well, but hey, we can try that out as well okay?)

An alphabet is defined as the set of characters the regex would use to define its rules. For example, the English language’s alphabet consists of letters from A to Z


Let’s have a look at the regex b[ae]d first. Your first guess as a rookie would be that it would match the string 'baed', but that would be incorrect. 
Well, you see, the characters a and e are enclosed by []. This holds special meaning in regex land. You are supposed to choose only one element out of all the elements enclosed by [] while making one string that can arise out of the pattern defined by the regex. Hence, the regex matches both, ‘bad’ and ‘bed’.

But how does one write a regex to select one character from a long list of characters, say from ijklmnopqrst? Wouldn’t it translate to [ijklmnopqrst]? Yes, that is correct, but really cumbersome. An easier way to go about it is to represent it as a range like [i-t], and you are done.


Let’s now shift over to the regex m(ar|oc)k. Carefully notice that there are ()symbols enclosing some characters. Also, notice that there's a | in between ar and oc.

The () means that these are miniature regexes and | means that you need to choose between the left miniature regex or the right miniature regex.

Now, my question to you is, what all strings would this pattern generate? Check the answer at the end of the post!


. is a wildcard in regex land. It represents choosing any one character out of the alphabet. For understanding it in more depth, try it out on the regexr platform.

Quantifiers

Quantifiers are characters or a set of characters used to indicate how many times the object preceding it should be allowed to repeat in the pattern.

A few common quantifiers are *,+,? and {a,b}.


The * Quantifier

The preceding element is repeated zero or more times

Example 1: a* matches ‘’(Empty String), ‘a’, ‘aa’, ‘aaa’, ...

Example 2: (mark|lol)* matches ‘’, ‘mark’, ‘markmark’, ‘markmark’, ... and ‘lol’, ‘lollol’, ‘lollollol’, ... and ‘marklol’, ‘marklollol’, ‘marklolmark’, ... and ‘lolmark’, ‘lolmarklol’, ‘lolmarkmark’, ... (you get the point, right?)


The + Quantifier

The preceding element is repeated 1 or more times, or as I learnt it, (element)(element)*

Example 1: a+ matches ‘a’, ‘aa’, ‘aaa’, ...

Example 2: [ab]+ matches ‘a’, ‘b’, ‘aa’, ‘bb’ ... and ‘ab’, ‘ba’, ‘aab’, ‘aba’, ‘baa’, ‘bba’, ...


The ? Quantifier

The preceding element repeated precisely zero or 1 times

Example 1: a?b matches ‘b’ and ‘ab’ only

Example 2: (mark|lol)?_zuckerberg matches ‘mark_zuckerberg’, ‘lol_zuckerberg’ and ‘_zuckerberg’ only


The {a,b} Quantifier

The preceding element repeated at least a times and at most b times

Example 1: hm{2,4} matches ‘hmm’, ‘hmmm’ and ‘hmmmm’ only

Example 2: l(ol){1,3} matches ‘lol’, ‘lolol’ and ‘lololol’ only


The {a} Quantifier

It's actually just an alternate representation of the {a,a} quantifier

Leaving Note

Well, the above writeup should get you started with the humongous and scary world of Regular Expressions.

The regex pattern m(ar|oc)k generates the strings ‘mark’ and ‘mock’ only. Kudos to you if you got that right! If not, try and give it a go again?

Drop me a message here, let me know if there are any technical errors or what you found incomprehensible or whatever is on top of your mind.

This is just what I was listening to, as a 1 hour mix, while writing this blog entry: