Python RegEx (Regular Expression) — A Comprehensive Guide With Examples
Regular Expressions can be used to search, edit and manipulate text. This opens up a vast variety of applications in all of the sub-domains under Python. Python RegEx is widely used by almost all of the startups and has good industry traction for their applications as well as making Regular Expressions an asset for the modern day programmer.
In this Python RegEx article, we will be checking out the following concepts:
- Why we use Regular Expressions?
- What are Regular Expressions?
- Basic Regular Expressions operations
- Email verification using Regular Expressions
- Phone number verification using Regular Expressions
- Web Scraping using Regular Expressions
Let’s begin this Python RegEx article by checking out why we need to make use of Regular Expressions.
Why Use Regular Expression?
To answer this question, we will look at the various problems faced by us which in turn is solved by using Regular Expressions.
Consider the following scenario:
You have a log file which contains a large sum of data. And from this log file, you wish to fetch only the date and time. As you can look at the image, readability of the log file is low upon first glance.
Regular Expressions can be used in this case to recognize the patterns and extract the required information easily.
Consider the next scenario — You are a salesperson and you have a lot of email addresses and a lot of those addresses are fake/invalid. Check out the image below:
What you can do is, you can make use of Regular Expressions you can verify the format of the email addresses and filter out the fake IDs from the genuine ones.
The next scenario is pretty similar to the one with the salesperson example. Consider the following image:
How do we verify the phone number and then classify it based on the country of origin?
Every correct number will have a particular pattern which can be traced and followed through by using Regular Expressions.
Next up is another simple scenario:
We have a Student Database containing details such as name, age, and address. Consider the case where the Area code was originally 59006 but now has been changed to 59076. To manually update this for each student would be time-consuming and a very lengthy process.
Basically, to solve these using Regular Expressions, we first find a particular string from the student data containing the pin code and later replace all of them with the new ones.
Regular expressions can be used with multiple languages. Such as:
- Java
- Python
- Ruby
- Swift
- Scala
- Groovy
- C#
- PHP
- Javascript
There is other ’n’ number of scenarios in which Regular Expressions help us. I will be walking you through the same in the upcoming sections of this article.
So, next up on this article, let us look at what Regular Expressions actually are.
What Are Regular Expressions?
A Regular Expression is used for identifying a search pattern in a text string. It also helps in finding out the correctness of the data and even operations such as finding, replacing and formatting the data is possible using Regular Expressions.
Consider the following example:
Among all of the data from the given string, let us say we require only the City. This can be converted into a dictionary with just the name and the city in a formatted way. The question now is that, can we identify a pattern to guess the name and the city? Also, we can find out the age too. With age, it is easy, right? it is just an integer number.
How do we go about with the name? If you take a look at the pattern, all of the names start with an uppercase. With the help of the Regular expressions, we can identify both the name and the age using this method.
Consider the following code:
import re
Nameage = '''
Janice is 22 and Theon is 33
Gabriel is 44 and Joey is 21
'''
ages = re.findall(r'\d{1,3}', Nameage)
names = re.findall(r'[A-Z][a-z]*',Nameage)
ageDict = {}
x = 0
for eachname in names
ageDict[eachname] = ages[x]
x+=1
print(ageDict)
There is no need to worry about the syntax at this point of time but since Python has amazing readability, you could very well guess what is happening the Regular Expression part of the code.
Output:
{'Janice': '22', 'Theon': '33', 'Gabriel': '44', 'Joey': '21'}
Next, in this article, let us check out all the operations we can perform using Regular Expressions.
Operations You Can Perform With Regular Expressions — RegEx Examples:
There are many operations you can perform by making use of Regular Expressions. Here, I have listed a few which are very vital in helping you understand the usage of Regular Expressions better.
Let us begin this article by first checking out how we can find a particular word in a string.
Finding a word in the string:
Consider the following piece of code:
import re
if re.search("inform","we need to inform him with the latest information"):
print("There is inform")
All we are doing here to search if the word inform exists in our search string. And if it does, then we get an output saying There is inform.
We can up this a little bit by writing a method which will do a similar thing.
import re
allinform = re.findall("inform","We need to inform him with the latest information!")
for i in allinform:
print(i)
Here, in this particular case inform will be found twice. One from the inform and the other from the information.
And it is as simple as this to find a word in a Regular Expression as shown above.
Next up on this Python RegEx blog, we will check out how we can generate an iterator using Regular Expressions.
Generating an iterator:
Generating an iterator is the simple process of finding out and reporting the starting and the ending index of the string. Consider the following example:
import re
Str = "we need to inform him with the latest information"
for i in re.finditer("inform.", Str):
locTuple = i.span()
print(locTuple)
For every match found, the starting and the ending index is printed. Can you take a guess of the output that we get when we execute the above program? Check it out below.
Output:
(11, 18)
(38, 45)
Pretty simple, right?
Next up on this Python RegEx blog, we will be checking out how we can match words with patterns using Regular Expressions.
Matching words with patterns:
Consider an input string where you have to match certain words with the string. To elaborate, check out the following example code:
import re
Str = "Sat, hat, mat, pat"
allStr = re.findall("[shmp]at", Str)
for i in allStr:
print(i)
What is common in the string? You can see that the letters ‘a’ and ‘t’ are common among all of the input strings. [shmp] in the code denotes the starting letter of the words to be found. So any substring starting with the letters s, h, m or p will be considered for matching. Any among that and compulsorily followed by ‘at’ at the end.
Output:
hat
mat
pat
Do note that they are all case sensitive. Regular expressions have amazing readability. Once you get to know the basics, you can start working on them in full swing and it’s pretty much easy, right?
Next up on this Python RegEx blog, we will be checking out how we can match a range of characters at once using Regular Expressions.
Matching series of range of characters:
We wish to output all the words whose first letter should start in between “h” and “m” and compulsorily followed by at. Checking out the following example we should realize the output we should get is a “hat and mat”, correct?
import re
Str = "sat, hat, mat, pat"
someStr = re.findall("[h-m]at", Str)
for i in someStr:
print(i)
Output:
hat
mat
Let us now change the above program very slightly to obtain a very different result. Check out the below code and try to catch the difference between the above one and the below one:
import re
Str = "sat, hat, mat, pat"
someStr = re.findall("[^h-m]at", Str)
for i in someStr:
print(i)
Found the subtle difference? We have added a caret symbol(^) in the Regular Expression. What this does it negates the effect of whatever it follows. Instead of giving us the output of everything starting with “h” to “m”, we will be presented with the output of everything apart from that.
The output we can expect is words which are NOT starting with letters in between “h” and “m” but still followed by at the last.
Output:
sat
pat
Next, in this article, I will explain how we can replace a string using Regular Expressions.
Replacing a string:
Next up, we can check out another operation using Regular Expressions where we replace an item of the string with something else. It is very simple and can be illustrated with the following piece of code:
import re
Food = "hat rat mat pat"
regex = re.compile("[r]at")
Food = regex.sub("food", Food)
print(Food)
In the above example, the word rat is replaced with the word food. The final output will look like this. The substitute method of the Regular Expressions is made use of this case and it has a vast variety of practical use cases as well.
Output:
hat food mat pat
Next up on this Python RegEx blog, we will check out a unique problem to Python called the Python Backslash problem.
The Backslash Problem:
Consider an example code shown below:
import re
randstr = "Here is \\Edureka"
print(randstr)
Output:
Here is \Edureka
This is the backslash problem. One of the slashes vanished from the output. This particular problem can be fixed using Regular Expressions.
import re
randstr = "Here is \\Edureka"
print(re.search(r"\\Edureka", randstr))
The output can be as follows:
<re.Match object; span=(8, 16), match='\\Edureka'>
As you can check out, the match for the double slashes has been found. And this is how simple it is to solve the backslash problem using Regular Expressions.
Next, in this article, I will walk you through how we can match a single character using Regular Expressions.
Matching a single character:
A single character from a string can be individually matched using Regular Expressions easily. Check out the following code snippet:
import re
randstr = "12345"
print("Matches: ", len(re.findall("\d{5}", randstr)))
The expected output is the 5th number that occurs in the given input string.
Output:
Matches: 1
Next, in this article, I will walk you through how we can remove newline spaces using Regular Expressions.
Removing Newline Spaces:
We can remove the newline spaces using Regular Expressions easily in Python. Consider another snippet of code as shown here:
import re
randstr = '''
You Never
Walk Alone
Liverpool FC
'''
print(randstr)
regex = re.compile("\n")
randstr = regex.sub(" ", randstr)
print(randstr)
Output:
You Never
Walk Alone
Liverpool FC
You Never Walk Alone Liverpool FC
As you can check out from the above output, the new lines have been replaced with whitespace and the output is printed on a single line.
There are many other things you could use as well depending on what you want to replace the string with. They are listed as follows:
- \b: Backspace
- \f: Formfeed
- \r: Carriage Return
- \t: Tab
- \v: Vertical Tab
Consider another example as shown below:
import re
randstr = "12345"
print("Matches:", len(re.findall("\d", randstr)))
Output:
Matches: 5
As you can see from the above output, \d matches the integers present in the string. However if we replace it with \D, it will match everything BUT an integer, the exact opposite of \d.
Next, in this article, let us walk through some important practical use-cases of making use of Regular Expressions in Python.
Practical Use Cases Of Regular Expressions
We will be checking out 3 main use-cases which are widely used on a daily basis. Following are the concepts we will be checking out:
- Phone Number Verification
- E-mail Address Verification
- Web Scraping
Let us begin this section of Python RegEx tutorial by checking out the first case.
Phone Number Verification:
Problem Statement — The need to easily verify phone numbers in any relevant scenario.
Consider the following Phone numbers:
- 444–122–1234
- 123–122–78999
- 111–123–23
- 67–7890–2019
The general format of a phone number is as follows:
- Starts with 3 digits and ‘-‘ sign
- 3 middle digits and ‘-‘ sign
- 4 digits in the end
We will be using \w in the example below. Note that \w = [a-zA-Z0–9_]
import re
phn = "412-555-1212"
if re.search("\w{3}-\w{3}-\w{4}", phn):
print("Valid phone number")
Output:
Valid phone number
E-mail Verification:
Problem statement — To verify the validity of an E-mail address in any scenario.
Consider the following examples of email addresses:
- Anirudh@gmail.com
- Anirudh @ com
- AC .com
- 123 @.com
Manually, it just takes you one good glance to identify the valid mail IDs from the invalid ones. But how is the case when it comes to having our program do this for us? It is pretty simple considering the following guidelines are followed for this use-case.
Guidelines:
All E-mail addresses should include:
- 1 to 20 lowercase and/or uppercase letters, numbers, plus . _ % +
- An @ symbol
- 2 to 20 lowercase and uppercase letters, numbers and plus
- A period symbol
- 2 to 3 lowercase and uppercase letters
Code:
import re
email = "ac@aol.com md@.com @seo.com dc@.com"
print("Email Matches: ", len(re.findall("[\w._%+-]{1,20}@[\w.-]{2,20}.[A-Za-z]{2,3}", email)))
Output:
Email Matches: 1
As you can check out from the above output, we have one valid mail among the 4 emails which are the inputs.
This basically proves how simple and efficient it is to work with Regular Expressions and make use of them practically.
Web Scraping
Problem Statement — Scrapping all of the phone numbers from a website for a requirement.
To understand web scraping, check out the following diagram:
We already know that a single website will consist of multiple web pages. And let us say we need to scrape some information from these pages.
Web scraping is basically used to extract the information from the website. You can save the extracted information in the form of XML, CSV or even a MySQL database. This is achieved easily by making use of Python Regular Expressions.
import urllib.request
from re import findall
url = "http://www.summet.com/dmsi/html/codesamples/addresses.html"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()
pdata = findall("\(\d{3}\) \d{3}-\d{4}", htmlStr)
for item in pdata:
print(item)
Output:
(257) 563-7401
(372) 587-2335
(786) 713-8616
(793) 151-6230
(492) 709-6392
(654) 393-5734
(404) 960-3807
(314) 244-6306
(947) 278-5929
(684) 579-1879
(389) 737-2852
(660) 663-4518
(608) 265-2215
(959) 119-8364
(468) 353-2641
(248) 675-4007
(939) 353-1107
(570) 873-7090
(302) 259-2375
(717) 450-4729
(453) 391-4650
(559) 104-5475
(387) 142-9434
(516) 745-4496
(326) 677-3419
(746) 679-2470
(455) 430-0989
(490) 936-4694
(985) 834-8285
(662) 661-1446
(802) 668-8240
(477) 768-9247
(791) 239-9057
(832) 109-0213
(837) 196-3274
(268) 442-2428
(850) 676-5117
(861) 546-5032
(176) 805-4108
(715) 912-6931
(993) 554-0563
(357) 616-5411
(121) 347-0086
(304) 506-6314
(425) 288-2332
(145) 987-4962
(187) 582-9707
(750) 558-3965
(492) 467-3131
(774) 914-2510
(888) 106-8550
(539) 567-3573
(693) 337-2849
(545) 604-9386
(221) 156-5026
(414) 876-0865
(932) 726-8645
(726) 710-9826
(622) 594-1662
(948) 600-8503
(605) 900-7508
(716) 977-5775
(368) 239-8275
(725) 342-0650
(711) 993-5187
(882) 399-5084
(287) 755-9948
(659) 551-3389
(275) 730-6868
(725) 757-4047
(314) 882-1496
(639) 360-7590
(168) 222-1592
(896) 303-1164
(203) 982-6130
(906) 217-1470
(614) 514-1269
(763) 409-5446
(836) 292-5324
(926) 709-3295
(963) 356-9268
(736) 522-8584
(410) 483-0352
(252) 204-1434
(874) 886-4174
(581) 379-7573
(983) 632-8597
(295) 983-3476
(873) 392-8802
(360) 669-3923
(840) 987-9449
(422) 517-6053
(126) 940-2753
(427) 930-5255
(689) 721-5145
(676) 334-2174
(437) 994-5270
(564) 908-6970
(577) 333-6244
(655) 840-6139
We first being by importing the packages which are needed to perform the web scraping. And the final result comprises of the phone numbers extracted as a result of the web scraping done using Regular Expressions.
Conclusion
I hope this Python RegEx tutorial helps you in learning all the fundamentals needed to get started with using Regular Expressions in Python.
This will be very handy when you are trying to develop applications that require the usage of Regular Expressions and similar principles. Now, you should also be able to use these concepts to develop applications easily with the help of Regular Expressions and Web Scraping too.
If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.
Do look out for other articles in this series which will explain the various other aspects of Python and Data Science.
2. Python Programming Language
6. Scikit Learn Machine Learning
11. PyGame Tutorial
12. OpenCV Tutorial
14. PyCharm Tutorial
16. Linear Regression Algorithm from scratch in Python
18. Loops in Python
19. Python Projects
21. Arrays in Python
22. Sets in Python
24. Python Interview Questions
25. Java vs Python
26. How To Become A Python Developer?
29. What is Socket Programming in Python
30. Python Database Connection
31. Golang vs Python
33. Python Career Opportunities
34. Machine Learning Classifier in Python
35. Python Scikit-Learn Cheat Sheet
37. Python Libraries For Data Science And Machine Learning
40. Python Modules
42. OOPs Interview Questions and Answers
43. Resume For A Python Developer
44. Exploratory Data Analysis In Python
45. Snake Game With Python’s Turtle Module
47. Principal Component Analysis
48. Python vs C++
49. Scrapy Tutorial
50. Python SciPy
51. Least Squares Regression Method
52. Jupyter Notebook Cheat Sheet
53. Python Basics
56. Python Decorator
58. Mobile Applications Using Kivy In Python
59. Top 10 Best Books To Learn & Practice Python
60. Robot Framework With Python
61. Snake Game in Python using PyGame
62. Django Interview Questions and Answers
63. Top 10 Python Applications
64. Hash Tables and Hashmaps in Python
65. Python 3.8
Originally published at www.edureka.co on March 8, 2019.