Regular Expressions from [a-zA-Z]

Siddharth
6 min readAug 28, 2023

--

Image by Author

What are Regular Expressions (Regex): -

Regular expression (regex) is a sequence of characters used for pattern matching and string manipulation. They provide a concise way to describe complex text patterns, enabling efficient searching. Regex are concise but can be complex due to their special characters and patterns. Regex also enables users to perform advanced pattern matching, grouping, backreferences and lookaheads/lookbehinds. In this blogpost we’ll be exploring regex using Python3.

Python RegEx: -

Python has an in-built module called ‘re’, which you must import to work with Regex.

Meta Characters: -

There are meta characters in regex which are used for constructing regex patterns. Here’s a list of different meta characters:

Image by Author

Special Sequences: -

These are short code representations of frequent patterns. They simplify some complex patterns. Here’s a list:

Image by Author

NOTE: — I’ve not included all special sequences, there are other sequences such as (\B, \W, etc.)

Example:

import re

text = "x8989072"
x = r'\A[a-zA-Z][0-9]+'

result = bool(re.match(x, text))
print(result) # Output: True

This will return ‘True’, if the string starts with any alphabet (lowercase OR uppercase) and is followed by any digit between ‘0–9’ one OR more number of times.

The search() function in Regex: -

This is used to search the string for a match based on the requirement. It always returns the first match in the string (not all matches are displayed).

import re

text = "Stop Scrolling Instagram, Start Learning Python"
x = re.search(r"S", text)
print(x)
# Output: <re.Match object; span=(0, 1), match='S'>

As you can see, only the first word “Stop” was considered. This is because the search function only returns the first match from the string.

The findall() function in Regex: -

This function returns an array (list) containing all the matches based on our regex.

import re

text = "AI's AI coded AI for AI to decipher AI's complex AI algorithms."
x = re.findall(r"AI", text)
print(x)
# Output: ['AI', 'AI', 'AI', 'AI', 'AI', 'AI']

Here, we can observe that the findall() function has created an array where it has appended all the instances of the word ‘AI’ it found in our string.

The sub() function in Regex: -

This function will replace the matches with the character/text of your choice.

import re

text = "Anxious alligators awkwardly assembled, eagerly awaiting their afternoon appetizer."
x = re.sub("a", "@", text)
print(x)
# Output: Anxious @llig@tors @wkw@rdly @ssembled, e@gerly @w@iting their @fternoon @ppetizer.

Here, we can see that all the occurrences of the letter ‘a’ were replaced with the character ‘@’.

The split() function in Regex: -

This will return an array (list), where the string is split based on our input.

import re

txt = "AI wrote a poem; roses are #FF0000, violets are #0000FF."
x = re.split("\s", txt)
print(x)
# Output: ['AI', 'wrote', 'a', 'poem;', 'roses', 'are', '#FF0000,', 'violets', 'are', '#0000FF.']

In this example, I split the string at any instance of whitespace character (\s).

Character Classes: -

Character classes in regex allow you to define a group of characters that you want to match. They are enclosed in “[]” (square brackets). Example: The regex ‘[aeiou]’ matches any vowel in your string. Here’s a list to better understand them:

Image by Author

Example:

import re

text = "Python has different libraries such as pandas, numpy, pytorch, scikit-learn, etc"
x = re.findall(r"[^aeiou\s]", text)
print(x)

This code will remove all the vowels and whitespace characters from the string. Character classes are also referred to as sets.

Quantifiers: -

Quantifiers in regex control the number of occurrences of a preceding element. They let you specify how many times a character, group, or character class should appear in the input string. Here’s a list to better understand them:

Image by Author

Example:

import re

text = "My email address is regexxy8008@python.com and my phone number is 45"
x = re.findall(r"[0-9]{3,4}", text)
print(x)
# Output: ['8008']

This will return only the digits that occur continuously for at least 3 times OR at the most 4 times.

Lookaheads: -

Lookaheads are advanced features in regex that allow you to check for patterns ahead of the current position in the string, without actually including them in the match.

A positive lookahead ‘(?=content)’ asserts that the specified pattern must be ahead of the current position in the string.

A negative lookahead ‘(?!content)’ asserts that the specified pattern must not be ahead of the current position.

Example: Suppose you want to find all occurrences of the word “pineapple” only if it is followed by the word “juice”.

import re

text = "I love pineapple juice, but I do not like pineapple ice-cream."
result = re.findall(r'pineapple(?= juice)', text)

if result:
print("Found 'pineapple' followed by 'juice'")
print(result)
else:
print("No match found")

This is an example of a positive lookahead.

Lookbehinds: -

Lookbehinds are advanced features which allow you to check for patterns behind the current position in the string.

A positive lookbehind ‘(?<=content)’ asserts that the specified pattern must be behind the current position.

A negative lookbehind ‘(?<!content)’ asserts that the specified pattern must not be behind the current position.

Example: Suppose you want to find all occurrences of the word “flower” only if it is not preceded by the word “blue”.

import re

text = "A red flower is my favourite, but blue flower is my father's favourite."
result = re.findall(r'(?<!blue )flower', text)

if result:
print("Found 'flower' not preceded by 'blue'")
print(result)
else:
print("No match found")

This is an example of a negative lookbehind.

Let’s Practice: -

Here are some complex ways to use Regex:

E-mail validation: -

import re

def validate_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))

print(validate_email("user@example.com")) # True
print(validate_email("invalid. Email")) # False

HTML Tag remover: -

import re

def remove_html_tags(text):
pattern = r'<.*?>'
return re.sub(pattern, '', text)

html_text = "<p>This is <b>bold</b> text.</p>"
print(remove_html_tags(html_text))
# Output: "This is bold text."

Extracting data from given text: -

import re

def extract_dates(text):
pattern = r'\d{2}/\d{2}/\d{4}'
return re.findall(pattern, text)

unstructured_text = "Meeting on 12/25/2023 and 01/15/2024"
print(extract_dates(unstructured_text))
# Output: ['12/25/2023', '01/15/2024']

Password strength checker: -

import re

def check_password_strength(password):
if re.match(r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@#$%&\_])[A-Za-z\d@#$%&\_]{8,}$', password):
return "Strong"
else:
return "Weak"

print(check_password_strength("P@ssw0rd")) # Strong
print(check_password_strength("password")) # Weak

This regex has following constraints: -

  1. password must at least be 8 characters long
  2. it must have an uppercase letter and a lowercase letter
  3. it must have at least 1 digit
  4. it must have at least 1 special character (@#$%&_)

IP address validation: -

import re

def validate_ip(ip):
pattern = r'^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
return all(re.match(pattern, part) for part in ip.split('.'))

print(validate_ip("192.168.1.1")) # True
print(validate_ip("256.256.256.256")) # False

Tokenization: -

import re

text = "He didn't want to go, but he had to. She said, 'It's time.'"
tokens = re.findall(r"\b\w+(?:[-']\w+)*\b|[.,!?';]", text)
print(tokens)

Tokenization can become quite complex when dealing with various text structures, languages, and special cases. Above, regex example includes matches for hyphens and apostrophes (e.g., “He’s”, “didn’t”).

Resources: -

Here are some great resources to practice Regex:

Pythex: -

Pythex is a Python based regular expression editor where you can test your regex examples.

Rexegg: -

On this website you can find comprehensive guides regarding various sub-topics related to regex (e.g., back-references, capture grouping, PCRE callouts, etc.)

ExtendsClass Regex Tester: -

On this website you test your regex patterns. It also supports different programming languages such as Python3, Ruby, Java, PHP, etc.

Regex101: -

This website allows you to test your regex patterns. It supports programming languages such as Python, Golang, Java, Rust, C#, etc.

Python Regex Cheatsheet: -

Regex document for Python.

Thank you for reading till the end, you did a great job decoding my regex rambling.

--

--

Siddharth

Data Science Enthusiast | Python Lover | Student | Gamer