Photo by Dmitry Ratushny on Unsplash

Regular Expressions in Python and PySpark, Explained

Britt
The Startup

--

Regular expressions commonly referred to as regex, regexp, or re are a sequence of characters that define a searchable pattern.

image via xkcd

Regular expressions often have a rep of being problematic and incomprehensible, but they save lines of code and time. They are useful when working with text data; and can be used in a terminal, text editor, and programming languages. Pandas’ string methods like .replace() or .findall() match on regex, and there is a library you can import, re.

Below I’ve mocked up two examples that demonstrate the power of regular expressions written in Python and PySpark code followed by explainers:

Extracting dates from text

Perhaps you’ve parsed text data from articles that include the date they were published, but unfortunately, the formatting is not consistent — they have mixed cases, the dates don’t always follow a phrase that contains “published”, some dates include abbreviated months while others are full months, and some make use of commas while some don’t. The following is the regex you can use to get around these issues, an explainer, and Python and Pyspark code snippets…

--

--