Regex Fundamentals: From Theory to Practical Usage

om pramod
8 min readDec 19, 2023

--

Part 1: Introduction

“Did you know a simple code can tell you if someone’s university ID is valid?” or “Ever wondered how websites verify your login details? It’s all thanks to something called a Regular Expression! As a data analytics enthusiast, wrangling messy spreadsheets used to be my nemesis. Until I discovered the power of Regular Expressions. One simple pattern, and suddenly, organizing chaos became a breeze! In this blog, I’ll demystify Regex for you, using real-life examples to showcase its potential. So, buckle up and join me on this exciting journey into the world of Regular Expressions!

Imagine you’re handed a massive jigsaw puzzle with pieces scattered everywhere. Each piece represents a fragment of information within a sea of text. Your task is to find and organize these pieces to reveal the bigger picture. This seemingly daunting challenge mirrors the initial impression one might have when encountering regular expressions — it’s like deciphering a unique language designed for manipulating text. When first trying to understand regular expressions, it seems as if it’s a different language. However, mastering regular expressions can save you thousands of hours if you work with text or need to parse large amounts of data. With regular expressions, you can automate tasks that would otherwise demand tedious manual effort.

Regular expressions, often abbreviated as regex or regexp or re, are powerful tools for pattern matching and text manipulation. With regex, you can define patterns to validate, search, replace, or extract specific strings from a larger body of text. At its core, a regular expression is a sequence of characters that forms a search pattern. This pattern is used to find matches within a given text. The syntax of regular expressions is based on a set of rules and special characters that define the search pattern. These special characters represent different types of characters or character classes, such as digits, letters, whitespace, or special symbols.

Regular expressions (regex) are a powerful tool for pattern matching and extraction in text. The terms “match” and “extract” do indeed have distinct meanings. While “matching” refers to identifying a specific pattern in the text, “extraction” involves pulling out a particular subset of the matched pattern. Let’s consider an example where we want to identify PhD scholars from a list of individuals in an organization. We might come across names prefixed with “Dr.”, indicating a doctorate degree. However, we’re interested in extracting only the name, not the “Dr.” prefix. In this case, we will match the “Dr XYZ” keyword and extract only the name, i.e. “XYZ” not the prefix “Dr.” from the list. Let’s dive into a simple example to see Regex in action:

Imagine we’re trying to understand the language of sheep. They express themselves in a specific way: their vocalizations follow a simple pattern: “baa!” with varying lengths of “a”s.

For instance, they express themselves as:

  • “baa!”
  • “baaaaa!”
  • “baaaaaaaa!”
  • and so on…

The challenge is figuring out a pattern that covers all these different sheep sounds. To crack this puzzle, we come up with a magical code called a “regular expression.” Fortunately, the entirety of Sheeptalk can be encapsulated using the regular expression baa+. This special code, baa+, is like a secret key that unlocks the mystery of Sheeptalk. Let’s break down this expression:

Reference

This is a simple yet powerful example of how regex can be used to match patterns in text. Here’s another example to consider…

Reference

Here’s a breakdown of what it does:

So, this regex will match any string that starts with two letters (either uppercase or lowercase) followed by three digits. It’s a common pattern for course codes, part numbers, and other alphanumeric identifiers.

Regular expressions are commonly used for tasks such as data validation, text parsing, and search and replace operations. Let’s take the example of matching date strings that follow the format of “month day, year” — a common structure found in various documents, articles, or datasets. Consider the task of identifying and extracting dates like “January 1, 2023” or “Jan 1, 2023” from a body of text. With regular expressions, this seemingly complex task becomes remarkably straightforward. To create a regular expression that captures these date strings, we can define a pattern that adheres to the format we described earlier:

  • “month” can be represented by at least one character.
  • “day” can be up to two digits.
  • “year” must be exactly four digits.

Putting it into Action:

Reference

Let’s consider another example, a regular expression can be used to match and validate an email address format, extract specific data from a text document, or find and replace all occurrences of a word or phrase within a large dataset. In the image below we can see an example of a Regex used to find email addresses. This can be useful, as having email addresses in plain text on your website can be a security vulnerability and result in email addresses being scraped.

Regular Expression e-mail matching example.

Here’s the breakdown:

So, this regex will match email addresses that are in the format of “localpart@domain.tld”. For example, it would match “example@example.com” or “user.name+tag@example.co.uk”. It ensures that the string of characters entered as an email address follows the standard format for email addresses. The regex pattern checks for specific criteria including the presence of ‘@’ symbol separating local and domain parts, valid characters before and after ‘@’ symbol such as alphanumeric characters, hyphens, underscores, and periods, and a domain part containing at least one period with valid top-level domains (like .com, .net).

While the example above might seem overwhelming initially, However, once you grasp the fundamental syntax and structure of regular expressions, interpreting the above example becomes as natural as reading a simple sentence. Let’s now explore the practical applications of regular expressions in the real world, to demonstrate the Importance of Regular Expressions.

Form Validation: The most common use of regular expressions is form validation.

  • Email Validation: Regex is used to ensure that the email address entered by the user follows the standard format, i.e., username@domain.tld.
  • Password Validation: Regex can be used to enforce password policies like minimum length, presence of uppercase and lowercase letters, numbers, and special characters.
  • Phone Number Validation: Regex can be used to validate phone numbers and ensure they follow a specific pattern.
  • URL Validation: Regex can be used to validate URLs to ensure they follow the standard URL format.
  • Username Validation: Regex can be used to restrict the characters allowed in a username, such as only allowing alphanumeric characters.

Bank Account details: You must have noticed that every bank has an IFSC code for its different branches that starts with the name of the bank. The credit card number consists of 16 digits and the first few digits represent whether the card is Master, Visa, or Rupay. In all these cases, regex is used.

Social Media Platforms: Regular expressions (regex) are indeed a fundamental tool used in the backend of many social media platforms like Google, Facebook, and Twitter to process searches. Here’s how:

· Google: Google uses regex in its search algorithms to understand and match search queries with relevant results. For instance, it can identify if a query is looking for a specific file type (like filetype:pdf) or if it’s a site-specific search (like site:example.com). For example, when you perform a search with a specific file type, such as filetype:pdf, Google uses regex to recognize the filetype: pattern and the file extension that follows. This allows Google to return search results that include PDF files. Similarly, if you want to search within a specific site using site:example.com, Google uses regex to identify the site: pattern and the URL that follows to return results specifically from that website.

· Facebook: Facebook uses regex in several ways. When you type something into the search bar, Facebook uses regex to match your search terms with user names, post content, group names, etc., and then highlights these matched terms in the search results.

· Twitter: Twitter uses regex to identify and process hashtags. When you compose a tweet and include a hashtag (like #example), Twitter uses regex to identify the # symbol followed by a sequence of characters as a hashtag. This hashtag then becomes a clickable link that leads to a search results page with other tweets that contain the same hashtag. This is all made possible by regex. You can also extract @mentions, URLs, etc.

Natural language processing(NLP):

· Data Cleaning: In NLP, data cleaning is a vital step that involves removing unnecessary or irrelevant data from the text. Regex is often used to remove stop words (commonly used words such as ‘is’, ‘an’, ‘the’, etc. that do not carry much meaningful information for the desired analysis) from the text.

· Sentence Boundary Detection: Regex can be used to identify the start and end of sentences within a larger body of text. This is often done by looking for punctuation marks like periods, question marks, and exclamation marks that typically signify the end of a sentence.

· Part-of-Speech Tagging: While regex is not typically the primary method used for part-of-speech tagging, it can be used to create rules that help guide the tagging process. For example, a rule might state that any word ending in “ing” should be tagged as a verb.

Closure Note: I extend my sincere appreciation for embarking on this exploration of regex basics. It is my hope that this introduction has illuminated the fundamental concepts of regex, serving as a stepping stone for your deeper dive into its intricate functionalities.

As we prepare for the next part, which delves into the syntax, functions, and real-world applications of regex, I encourage you to maintain your curiosity and enthusiasm in unraveling the diverse possibilities that regular expressions offer. Don’t miss out on Part 2: Understanding Regex Components. Your curiosity and enthusiasm have brought you this far, and I’m excited for what lies ahead. Join me in the next installment as we deepen our understanding of regex and unlock its full potential. Happy coding and see you in part two!

--

--