Regular expressions (regex) in Stata
(Last updated: Feb 2024)
This guide covers one of the most under-documented features of Stata: regular expressions, or regex for short. In this guide we will learn how to implement the regex features shown in the Stata cheat sheet below. This includes learning about quantifiers, building bottom-up specific and generic expressions, using word boundaries, and choosing between greedy versus possessive matching.
Regex is the core pattern-matching algorithm used for text searches. Implemention of regex is ubiquitous on the internet where the algorithm is invoked when doing autofills, and password checks. For example, regex controls whether your password has sufficient characters or whether it is strong enough without actually storing or seeing the password. Regex is extremely powerful, and can be incorporated in several tools such as text mining, natural language processing (NLP), sentiment analysis, machine learning (ML), automated journalism, auto completing text, and programming web crawlers. Companies like Google already probably use some version of regex to sift through your emails to find keywords for targeted advertising.