Regular Expressions in ChatGPT

Published in

Eventual Consistency

4 min readMay 2, 2023

LLMs such as ChatGPT can be powerful translators between natural languages and domain-specific, formal languages. A prime illustration of this capability are regular expressions.

By leveraging the sophisticated natural language understanding capabilities of these models, users can easily translate their textual requirements into precise, well-structured regular expressions for various applications.

This translation process not only saves time and effort for developers and non-technical users alike but also democratizes access to the powerful features offered by regular expressions. From searching and filtering data to validating input and manipulating text, regular expressions can be seamlessly integrated into numerous tasks across diverse domains.

Furthermore, the ability of LLMs to generate regular expressions based on natural language descriptions can help reduce errors and increase the overall efficiency of development processes. By minimizing the need for manual intervention and trial-and-error, these models can play a crucial role in accelerating the development of robust and effective software solutions.

Example 1: Match any IP address in a given subnet

^120\.11\.20\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$

Explanation of the regex pattern:

^: Asserts the beginning of the string.
120\.11\.20\.: Matches the first three octets of the IP address (120.11.20).
(?:: Begins a non-capturing group.
25[0-5]: Matches numbers from 250 to 255.
|: Alternation, acts as an "OR" operator.
2[0-4]\d: Matches numbers from 200 to 249.
|: Alternation, acts as an "OR" operator.
[01]?\d\d?: Matches numbers from 0 to 199, allowing for optional leading zeros.
): Closes the non-capturing group.
$: Asserts the end of the string.

This regular expression will match any IP address in the specified subnet, from 120.11.20.0 to 120.11.20.255.

Example 2: Match an email address

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Explanation of the regex pattern:

^: Asserts the beginning of the string.
[a-zA-Z0-9._%+-]+: Matches one or more alphanumeric characters, periods, underscores, percent signs, plus signs, or hyphens in the local part of the email address.
@: Matches the "@" symbol.
[a-zA-Z0-9.-]+: Matches one or more alphanumeric characters, periods, or hyphens in the domain name.
\.: Matches a period.
[a-zA-Z]{2,}: Matches two or more alphabetical characters in the top-level domain (TLD).
$: Asserts the end of the string.

Caveat: This regular expression covers many basic email address formats. However, email addresses can be more complex than this pattern, and the pattern may not cover all valid email addresses. To validate email addresses according to the specifications of RFC 5322, you would need a more complex regex pattern or a dedicated email validation library.

Examples 3: Match an Italian fiscal code

An Italian fiscal code, also known as “Codice Fiscale,” is a unique identifier assigned to individuals and companies for tax purposes. It consists of 16 characters: 6 letters for the surname, 6 letters for the first name, 2 digits for the birth year, 1 letter for the birth month, 2 digits for the birth day, and 1 letter for the place of birth.

^[A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]$

Caveat: This regular expression will match the basic structure of an Italian fiscal code but does not validate the specific rules for generating the letters and digits based on the individual’s information. To fully validate an Italian fiscal code, you would need a more complex regex pattern or a dedicated validation library.

Example 4: Match any HTML tag

<\s*\/?\s*[\w-]+(\s+[\w-]+\s*(=\s*("[^"]*"|'[^']*'|[^>\s]+))?\s*)*\/?\s*>

Caveat: This regular expression will match most HTML tags, including self-closing tags and those with attributes. However, it’s essential to note that regular expressions are generally not recommended for parsing HTML, as HTML can be complex and irregular. Instead, consider using an HTML parsing library or a DOM parser to work with HTML documents.

Example 5: Match basic SQL queries

(?i)(SELECT\s[\w\*\)\(\,\s]+\sFROM\s[\w]+)|(UPDATE\s[\w]+\sSET\s[\w\,\'\=]+)|(INSERT\sINTO\s[\d\w]+[\s\w\d\)\(\,]*\sVALUES\s\([\d\w\'\,\)]+)|(DELETE\sFROM\s[\d\w\'\=]+)

Caveat: This regular expression covers basic SQL queries, but it may not handle more complex or nested queries. It is also not a substitute for proper SQL parsing or validation libraries.

Example 6: Match any JSON string

^\s*(\{(?:\s*"(?:\\.|[^"\\])*"\s*:\s*(?:"(?:\\.|[^"\\])*"|null|true|false|\d+(\.\d+)?(?:[eE][+-]?\d+)?)(?:\s*,\s*"(?:\\.|[^"\\])*"\s*:\s*(?:"(?:\\.|[^"\\])*"|null|true|false|\d+(\.\d+)?(?:[eE][+-]?\d+)?))*\s*)?\})\s*$

Caveat: Matching any JSON string with a single regular expression is difficult and not recommended due to the complexity and nested structure of JSON. The best approach for validating or parsing JSON is to use a dedicated JSON library in your programming language.

Bonus example: Generate random data for a given RegEx

Generate random data for this expression:

^([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}$

ChatGPT’s answer: This regular expression appears to match domain names. Here’s a Python script that generates random domain names that match this regular expression:

import random
import string

def random_domain_name():
    domain_length = random.randint(3, 7)
    tld_length = random.randint(2, 6)
    domain = ''.join(random.choices(string.ascii_letters + string.digits, k=domain_length))
    tld = ''.join(random.choices(string.ascii_lowercase, k=tld_length))
    return f"{domain}.{tld}"

# Generate 10 random domain names
for _ in range(10):
    print(random_domain_name())

What is astounding here is the model’s capability to grasp at a semantic level what the given regular expression is about and to implement this knowledge in more advanced data processing tasks.

Bonus example #2: Data redaction

Write a regular expression to match phone numbers
and use it to redact all but the 4 last digits

ChatGPT’s answer: Here’s a regular expression that matches various phone number formats:

(?:\+?\d{1,3}[-\s.]?)?\(?(?:\d{1,4}\)?[-\s.]?)?\d{1,4}[-\s.]?\d{1,4}([-.\s]\d{1,4})?

This regex pattern is quite general and may match different phone number formats, such as those with or without country codes, area codes, or extensions. It is important to note that this regex may not cover all phone number formats.

You can use this regular expression to redact phone numbers, leaving only the last four digits visible. Here’s an example in Python:

import re

def redact_phone_numbers(text):
    pattern = r'(?:\+?\d{1,3}[-\s.]?)?\(?(?:\d{1,4}\)?[-\s.]?)?\d{1,4}[-\s.]?\d{1,4}([-.\s]\d{1,4})?'
    return re.sub(pattern, lambda match: f"REDACTED-{match.group(0)[-4:]}", text)

text = "Call me at +1 (555) 123-4567 or at 555-6789 ext. 1234."
redacted_text = redact_phone_numbers(text)
print(redacted_text)