Regular Expressions For Beginners: Lesson 1

Umar Ashfaq
Eastros
Published in
6 min readAug 7, 2012

A regular expression defines a search pattern for strings. This pattern may match one or several times or not at all for a given string. The abbreviation for regular expression is regex.

Regular expressions can helps in

  • Searching and pattern matching
  • Text replacement
  • Collecting information ( word count, no. of spaces, no. of lines)
  • Available in almost all languages (Perl, PHP, Java, .Net, JavaScript)
  • Many default operating system programs support regex ( grep, windows search). Some other programs like syntax highlighters and Eclipse Search also use regex power.

I would highly recommend to go through all the syntax in the below table and get yourself familiar with it. This is, and should be, your first step in learning regex. If you don’t get what the basic syntax means, then you’ll probably not be able to write efficient and correct regex. One more thing, a bad/wrong regex can be catastrophic.

Lets dive into the deep sea of regex by learning its basic syntax first.

Basic Syntax

Syntax CharacterMeaning

^

Matches beginning of the string.
For example, ^A does not match the ‘A’ in “an A”, but does match the first ‘A’ in “An A.”

$

Matches end of the string.
For example, t$/ does not match the ‘t’ in “eater”, but does match it in “eat”.

.

(The decimal point) matches any single character except the newline characters: \n \r \u2028 or \u2029.
For example, /.n/ matches ‘an’ and ‘on’ in “nay, an apple is on the tree”, but not ‘nay’.\For characters that are usually treated as specially, indicates that the next character is not special and should be interpreted literally.For example, * is a special character that means 0 or more occurrences of the preceding character should be matched.For example, a* means match 0 or more “a”s. To match * literally, precede it with a backslash; for example, a\* matches ‘a*’.* is explained below*Matches the preceding item 0 or more times.For example, bo* matches ‘boooo’ in “A ghost booooed” and ‘b’ in “A bird warbled”, but nothing in “A goat grunted”.+Matches the preceding item 1 or more times.For example, a+ matches the ‘a’ in “candy” and all the a’s in “caaaaaaandy”.?Matches the preceding item 0 or 1 time.For example, e?le? matches the ‘el’ in “angel” and the ‘le’ in “angle.”If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times), as opposed to the default, which is greedy (matching the maximum number of times).\dMatches any digit. Its equivalent to [0–9].
For example, \d or [0–9] matches ‘2’ in “A2”.\DMatches any non-digit character. Its equivalent to [⁰-9].
For example, \D matches ‘B’ in ‘A2’.\wMatches any alphanumeric character including under scored. Its equivalent to [a-zA-Z0–9_].
For example, \w matches ‘a’ in “apple,” ‘5’ in “$5.28,” and ‘3’ in “3D.”\WOpposite to \w. Matches any character other than alphanumeric or underscore. Its equivalent to [^a-zA-Z0–9_]\sMatches a single white space character, including space, tab, form feed, line feed and other unicode spaces.For example, \s\w* matches ‘ bar’ in “foo bar.”\SMatches a single character other than white space.For example, \S\w* matches ‘foo’ in “foo bar.”\tMatches a tab(x)Matches x and remembers the match. These are called capturing parentheses.For example, (foo) matches and remembers ‘foo’ in “foo bar.” The matched substring can be recalled from the resulting array’s elements [1], …, [n]. Almost all languages have pre-defined support for accessing the captured groups.(?:x)Opposite to (x). Matches x but does not remember the match.x(?=y)Matches x if and only if followed by y.
For example, Jack(?=Sprat) matches ‘Jack’ only if it is followed by ‘Sprat’. Jack(?=Sprat|Frost) matches ‘Jack’ only if it is followed by ‘Sprat’ or ‘Frost’. However, neither ‘Sprat’ nor ‘Frost’ is part of the match results.x(?!y)Opposite to x(?=y). Matches x only if not followed by y.
For example, \d+(?!\.) matches a number only if it is not followed by a decimal point.x|yMatches x or y.
For example, green|red matches ‘green’ in “green apple” and ‘red’ in “red apple.”{n}Where n is a positive integer. Matches exactly n occurrences of the preceding item.For example, a{2} doesn’t match the ‘a’ in “candy,” but it matches all of the a’s in “caandy,” and the first two a’s in “caaandy.”{n,}Where n is a positive integer. Matches at least n occurrences of the preceding item.For example, a{2,} doesn’t match the ‘a’ in “candy”, but matches all of the a’s in “caandy” and in “caaaaaaandy.”{n,m}Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding item.For example, a{1,3} matches nothing in “cndy”, the ‘a’ in “candy,” the first two a’s in “caandy,” and the first three a’s in “caaaaaaandy”. Notice that when matching “caaaaaaandy”, the match is “aaa”, even though the original string had more a’s in it.[xyz]A character set. Matches anyone of the enclosed characters x or y or z. A range of characters can also be specified within square brackets.
For example, [abcd] is the same as [a-d]. They match the ‘b
‘ in “brisket” and the ‘c’ in “chop”.[^xyz]Opposite to [xyz]. Its called a negated or complimented character set. Matches anything that is not in the brackets.
For example, [^abc] is the same as [^a-c]. They initially match ‘r’ in “brisket” and ‘h’ in “chop.”

Modifiers

Every language that supports regex, will provide some parameters to manipulate the regex. These parameters or flags can be used separately or in combination. The list is as follows.

  • g : Perform case-insensitive matching
  • i : ignore case
  • m : Perform multiline matching

Examples in Java and JavaScript

An example in Java and Javascript will tell you how these two languages handle regex.

  • Find the string John in sentence “Mike and John were good friends”
    JavaScript
    [sourcecode language=”java”]
    var str = ‘Mike and John were good friends’
    var regex = /John/;
    alert(str.match(regex)); //shows John [/sourcecode]
  • But the above pattern is case sensitive, so pattern /john/ will show null. To makeit case insensitive, add case insensitive modifier “I” at the end.
  • [sourcecode language=”java”]
    regex = /john/i;
    alert(str.match(regex)); //shows Jhon [/sourcecode]
  • In Java
  • [sourcecode language=”java”]
    String str = “John and Mike were good friends”;
    Pattern p = Pattern.compile(“John”);
    Matcher m = p.matcher(str);
    if(m.find()){
    System.out.print(m.group()); //will output John
    } [/sourcecode]
  • Now when you familiar with syntax, I’ll only write down the patterns and not the whole syntax for rest of the examples.
  • Do a global search for “is”:
    [sourcecode language=”java”]
    var str=’Is this all there is’;
    var patt1=/is/g; // matches first is in “this” and then the last “is” [/sourcecode]
  • The marked text below shows where the expression gets a match
    Is this all there is?
    Make above string case insensitive
  • [sourcecode language=”java”] var pat1 = /is/ig [/sourcecode]
  • then result would be
    Is this all there is?
  • Basic email address validating
    Our valid email address will contain any number of alpha numeric characters, underscores and hyphens at the start, then an @ sign and then any number of alpha numeric combinations representing domain name then a dot(.) and then a few more character representing first level TID(.com,.net).[sourcecode language=”java”]
    \b[\w.]+@[\w]+\.[a-zA-Z]{2,3}\b
    [/sourcecode]
  • Alternatively, it can also be written liket his
  • [sourcecode language=”java”]
    \b[a-zA-Z0–9_.]+@[a-zA-Z0–9_]+\.[a-zA-Z]{2,3}\b
    [/sourcecode]
  • Explaination
  • [sourcecode language=”java”]
    \b # A word boundry
    [\w.] or [a-zA-z0–9_.] # \w and [a-zA-Z0–9_] are same. It means any among alpha numeric character including _ and a dot(.). Dot inside square brackets are taken latterly.
    + # One or many number of times
    @ # at the rate symbol
    [\w] or [a-zA-z0–9_] # Any alpha numeric character
    + # 1 or more number of times
    \. # a dot(.). Since it is a special character, it must be escaped by \
    [a-zA-Z] # A through Z, small or capital.
    {2,3} # At least 2 and at max 3 preceding characters
    \b # Word boundry
    [/sourcecode]
  • This regex will not validate any email address but only a specific type of email addresses. It can validate following addresses
    ali.hammad@yahoo.com
    ali_h_1984@my_domain.net
  • But it will not validate these email addresses
  • ali@domain.co.uk
    ali.hammad #must contain @ symbol
    ali.h8@yahoo.c # Last characters after . must be at least 2 in number
  • I will soon write a complete email validator regex and I will share that with you. But this basic regex will serve your purpose of understanding regex.
  • Another example
    Our string is:
    [sourcecode language=”java”]Don’t ever write another regex / regular expression / RegExp blindly [/sourcecode]
  • and we want to find out if the string contains any of these words
    1- regex
    2- regular expression
    4- RegExp
  • The pattern would be something like this
  • [sourcecode language=”java”]\b[Rr]eg([Ee]xp?|ular expression)\b[/sourcecode]
  • [sourcecode language=”java”]
    \b # A word boundry
    [Rr] # Any one R or r
    eg # Characters e and g without any space
    ( # Start of capturing group
    [Ee]xp # Anyone E or e then followed by xp
    ? # Preceding character ‘p’ zero or one times.
    | # OR
    ular expression # ular without any space then a space and then character expression
    ) # Close capturing group
    \b # A word boundry
    [/sourcecode]
  • This pattern can match words “regex” or “Regex”, “regular expression” or “Regular expression” but not “Regular Expression” with capital E. It can also match “RegExp” or “regexp”.

--

--