Pattern Matching

Using Regexes For Pattern Matching In Strings

Aman Jamshed
Apr 12 · 4 min read

Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of regular expressions as a pattern that you give to a regex processor with some source data. The processor then parses that source data using that pattern and returns chunks of text back for further manipulation.

There are three main reasons you would want to do this :

  • To check whether a pattern exists within some source data
  • To get all the instances of a complex pattern from some source data
  • To clean your source data using patterns generally through string splitting.

Regexes are a foundational technique for data cleaning in data science and a solid understanding of regexes will help you quickly and efficiently manipulate text data for further data science application.

Let’s see how regex works

First we’ll import the ‘re’ module, which is where python stores regular expressions. There are several main processing functions in ‘re’ like match() checks for a match that is at the beginning of the string and returns a boolean . Similarly, search() checks for a match anywhere in the string and returns a boolean.

The split() functions use a pattern to split the given string and return a list of substrings, findall() will look for a pattern and pull out all the occurrences.

Now let's see some more complex examples. The regex specifications standard defines a markup language to describe patterns in the text. The caret character ‘^’ means start and the dollar sign ‘$’ means end. If we put ^ before a string, its means that the text that the regex processor retrieves must start with the string we specify. Similarly, when we put ‘$’ after the string, it means that the text Regex retrieves must end with the string we specify.

re.search() returned a new object called re.Match object which has a boolean value and rendering of the match object also tells us what pattern was matched and the location of matched pattern as the span.

Let's see character classes, Let’s take a string of single learner grades over a semester i.e. grades = “ACAAAABCBCBAA”. If we wanted to count the number of A’s and B’s in the list we’ll use a set operator “[]”.

If we want all instances where this student receives an A followed by a B or a C. We can write this using set operator “[]” or by using pipe operator “|”, which means OR.

Now let's move on to quantifiers. Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m, n} where ‘e’ is the expression or a character we want to match, ‘m’ is the minimum number of times you want it to be matched and ‘n’ is the maximum number of times the item could be matched.

Let’s use the above grades as an example. How many times has this student been on a back-to-back A’s streak? or if we want to see decreasing trend in a student’s grades.

There are other quantifiers that are used as shorthands, an asterisk ‘*’ to match 0 or more times, a question mark ‘?’ one or more times.

This is just an overview of regular expressions, and really we’ve just scratched the surface of what we can do with regexes. They’re incredibly powerful. If you want to learn about them then you can refer to python documentation for regex.

Thank you for reading!

If you find this blog useful, give it a clap : )

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

TheLeanProgrammer

Never stop exploring!

TheLeanProgrammer

The biggest power in the world is to be able to give life to something, and guess what? Code gives you this ability! Here in this publication, we build stuff, we share knowledge in tech, and share our stories, feel free to join — https://theleanprogrammer.com/writer-request/

Aman Jamshed

Written by

If you're trying, you're already winning!🥂

TheLeanProgrammer

The biggest power in the world is to be able to give life to something, and guess what? Code gives you this ability! Here in this publication, we build stuff, we share knowledge in tech, and share our stories, feel free to join — https://theleanprogrammer.com/writer-request/

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store