Become a Python Expert- Learn PATTERN MATCHING WITH REGULAR EXPRESSIONS using Python (part-1)
Are you ready to embark on an exhilarating coding adventure? Python is the key that unlocks a world of endless possibilities, where your imagination knows no bounds. Whether you’re a coding wizard or just starting your journey, Python welcomes you with open arms and empowers you to bring your wildest ideas to life. With Python, coding becomes a thrilling experience like no other. It's intuitive syntax and readability make it a breeze to learn and master. Say goodbye to tedious syntax errors and hello to smooth, error-free code that flows effortlessly from your fingertips. Python’s simplicity is a true game-changer, allowing you to focus on unleashing your creativity and turning your vision into reality. But Python isn’t just about simplicity; it’s also a powerful language that flexes its muscles across a multitude of domains. From web development to data science, machine learning to artificial intelligence, Python’s versatility shines through. Today you are going Learn Pattern matching with me.
You may be familiar with searching for text by pressing CTRL-F and entering the words you’re looking for. Regular expressions go one step further: they allow you to specify a pattern of text to search for. You may not know a business’s exact phone number, but if you live in the United States or Canada, you know it will be three digits, followed by a hyphen, and then four more digits (and optionally, a three-digit area code at the start). This is how you, as a human, know a phone number when you see it: 415–555–1234 is a phone number, but 4,155,551,234 is not.
We also recognize all sorts of other text patterns every day: email addresses have @ symbols in the middle, US social security numbers have nine digits and two hyphens, website URLs often have periods and forward slashes, news headlines use title case, social media hashtags begin with # and contain no spaces and more.
Regular expressions are helpful, but few non-programmers know about them even though most modern text editors and word processors, such as Microsoft Word or OpenOffice, have find-and-find-and-replace features that can search based on regular expressions. Regular expressions are huge time-savers, not just for software users but also for programmers. In fact, tech writer Cory Doctorow argues that we should be teaching regular expressions even before programming:
Knowing [regular expressions] can mean the difference between solving a problem in 3 steps rather than solving it in 3,000 steps.
Finding Patterns of Text with Regular Expressions:
Let’s say we want to find a phone number in a text. You know the pattern if you’re American (Which I am not 😁): three numbers, a hyphen, three numbers, a hyphen, and four numbers. Here’s an example: 415–555–4242.
If do not use regular expression and try to solve it using common functions from Python then our code would be looked like this.
The output of the code will look like this:
The isPhoneNumber() function has code that does several checks to see whether the string in the text is a valid phone number. If any of these checks fail, the function returns False. First, the code checks that the string is exactly 12 characters ➊. Then it checks that the area code (that is, the first three characters in the text) consists of only numeric characters ➋. The rest of the function checks that the string follows the pattern of a phone number: the number must have the first hyphen after the area code ➌, three more numeric characters ➍, then another hyphen ➎, and finally four more numbers ➏. If the program execution manages to get past all the checks, it returns True ➐.
Calling isPhoneNumber() with the argument ‘415–555–4242’ will return True. Calling isPhoneNumber() with ‘Moshi moshi’ will return False; the first test fails because ‘Moshi moshi’ is not 12 characters long.
The previous phone number–finding program works, but it uses a lot of code to do something limited: the isPhoneNumber() function is 17 lines but can find only one pattern of phone numbers. What about a phone number formatted like 415.555.4242 or (415) 555–4242? What if the phone number had an extension, like 415–555–4242 x99? The isPhoneNumber() function would fail to validate them. You could add yet more code for these additional patterns, but there is an easier way.
Regular expressions, called regexes for short, are descriptions of a pattern of text. For example, a \d in a regex stands for a digit character — that is, any single numeral from 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text pattern the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d\d\d regex.
But regular expressions can be much more sophisticated. For example, adding a 3 in braces ({3}) after a pattern is like saying, “Match this pattern three times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number format.
Creating Regex Objects:
All the regex functions in Python are in the “re” module. The complete code is given below:
Passing a string value representing your regular expression to re.compile() returns a Regex pattern object (or simply, a Regex object). A Regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method returns a Match object, which has a group() method that will return the actual matched text from the searched string.
The MN variable name is just a generic name to use for Match objects. This example might seem complicated at first, but it is much shorter than the earlier isPhoneNumber.py program and does the same thing.
Review of Regular Expression Matching
While there are several steps to using regular expressions in Python, each step is fairly simple.
- Import the regex module with import re.
- Create a Regex object with the re.compile() function. (Remember to use a raw string.)
- Pass the string you want to search into the Regex object’s search() method. This returns a Match object.
- Call the Match object’s group() method to return a string of the actual matched text.
If you find it interesting and amazing then let's learn more.
Grouping with Parentheses:
Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.
The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text. Enter the following into the interactive shell:
If you would like to retrieve all the groups at once, use the groups() method — note the plural form for the name.
Since MN.groups() returns a tuple of multiple values, you can use the multiple-assignment trick to assign each value to a separate variable, as in the previous areaCode, mainNumber = MN.groups() line.
Parentheses have a special meaning in regular expressions, but what do you do if you need to match a parenthesis in your text? For instance, maybe the phone numbers you are trying to match have the area code set in parentheses. In this case, you need to escape the ( and ) characters with a backslash. Enter the following into the interactive shell:
The \( and \) escape characters in the raw string passed to re.compile() will match actual parenthesis characters. In regular expressions, the following characters have special meanings:
If you want to detect these characters as part of your text pattern, you need to escape them with a backslash:
Make sure to double-check that you haven’t mistaken escaped parentheses \( and \) for parentheses ( and ) in a regular expression. If you receive an error message about “missing )” or “unbalanced parenthesis,” you may have forgotten to include the closing unescaped parenthesis for a group, like in this example:
The error message tells you that there is an opening parenthesis at index 0 of the r’(\(Parentheses\)’ string that is missing its corresponding closing parenthesis.
I think that is enough for today's learning. I hope you enjoyed the blog. See you soon in my next blog for further learning about pattern matching in Python. Most of the content is collected from the following book. I recommend you this book for a detailed explanation😊.