Python RegEx (Regular Expression) — A Comprehensive Guide With Examples

Aayushi Johari
Edureka
Published in
14 min readMar 8, 2019

--

Python Regular Expressions — Edureka

Regular Expressions can be used to search, edit and manipulate text. This opens up a vast variety of applications in all of the sub-domains under Python. Python RegEx is widely used by almost all of the startups and has good industry traction for their applications as well as making Regular Expressions an asset for the modern day programmer.

In this Python RegEx article, we will be checking out the following concepts:

  • Why we use Regular Expressions?
  • What are Regular Expressions?
  • Basic Regular Expressions operations
  • Email verification using Regular Expressions
  • Phone number verification using Regular Expressions
  • Web Scraping using Regular Expressions

Let’s begin this Python RegEx article by checking out why we need to make use of Regular Expressions.

Why Use Regular Expression?

To answer this question, we will look at the various problems faced by us which in turn is solved by using Regular Expressions.

Consider the following scenario:

You have a log file which contains a large sum of data. And from this log file, you wish to fetch only the date and time. As you can look at the image, readability of the log file is low upon first glance.

Regular Expressions can be used in this case to recognize the patterns and extract the required information easily.

Consider the next scenario — You are a salesperson and you have a lot of email addresses and a lot of those addresses are fake/invalid. Check out the image below:

What you can do is, you can make use of Regular Expressions you can verify the format of the email addresses and filter out the fake IDs from the genuine ones.

The next scenario is pretty similar to the one with the salesperson example. Consider the following image:

How do we verify the phone number and then classify it based on the country of origin?

Every correct number will have a particular pattern which can be traced and followed through by using Regular Expressions.

Next up is another simple scenario:

We have a Student Database containing details such as name, age, and address. Consider the case where the Area code was originally 59006 but now has been changed to 59076. To manually update this for each student would be time-consuming and a very lengthy process.

Basically, to solve these using Regular Expressions, we first find a particular string from the student data containing the pin code and later replace all of them with the new ones.

Regular expressions can be used with multiple languages. Such as:

  • Java
  • Python
  • Ruby
  • Swift
  • Scala
  • Groovy
  • C#
  • PHP
  • Javascript

There is other ’n’ number of scenarios in which Regular Expressions help us. I will be walking you through the same in the upcoming sections of this article.

So, next up on this article, let us look at what Regular Expressions actually are.

What Are Regular Expressions?

A Regular Expression is used for identifying a search pattern in a text string. It also helps in finding out the correctness of the data and even operations such as finding, replacing and formatting the data is possible using Regular Expressions.

Consider the following example:

Among all of the data from the given string, let us say we require only the City. This can be converted into a dictionary with just the name and the city in a formatted way. The question now is that, can we identify a pattern to guess the name and the city? Also, we can find out the age too. With age, it is easy, right? it is just an integer number.

How do we go about with the name? If you take a look at the pattern, all of the names start with an uppercase. With the help of the Regular expressions, we can identify both the name and the age using this method.

Consider the following code:

import re

Nameage = '''
Janice is 22 and Theon is 33
Gabriel is 44 and Joey is 21
'''

ages = re.findall(r'\d{1,3}', Nameage)
names = re.findall(r'[A-Z][a-z]*',Nameage)

ageDict = {}
x = 0
for eachname in names
ageDict[eachname] = ages[x]
x+=1
print(ageDict)

There is no need to worry about the syntax at this point of time but since Python has amazing readability, you could very well guess what is happening the Regular Expression part of the code.

Output:

{'Janice': '22', 'Theon': '33', 'Gabriel': '44', 'Joey': '21'}

Next, in this article, let us check out all the operations we can perform using Regular Expressions.

Operations You Can Perform With Regular Expressions — RegEx Examples:

There are many operations you can perform by making use of Regular Expressions. Here, I have listed a few which are very vital in helping you understand the usage of Regular Expressions better.

Let us begin this article by first checking out how we can find a particular word in a string.

Finding a word in the string:

Consider the following piece of code:

import re

if re.search("inform","we need to inform him with the latest information"):
print("There is inform")

All we are doing here to search if the word inform exists in our search string. And if it does, then we get an output saying There is inform.

We can up this a little bit by writing a method which will do a similar thing.

import re

allinform = re.findall("inform","We need to inform him with the latest information!")

for i in allinform:
print(i)

Here, in this particular case inform will be found twice. One from the inform and the other from the information.

And it is as simple as this to find a word in a Regular Expression as shown above.

Next up on this Python RegEx blog, we will check out how we can generate an iterator using Regular Expressions.

Generating an iterator:

Generating an iterator is the simple process of finding out and reporting the starting and the ending index of the string. Consider the following example:

import re

Str = "we need to inform him with the latest information"

for i in re.finditer("inform.", Str):
locTuple = i.span()
print(locTuple)

For every match found, the starting and the ending index is printed. Can you take a guess of the output that we get when we execute the above program? Check it out below.

Output:

(11, 18) 
(38, 45)

Pretty simple, right?

Next up on this Python RegEx blog, we will be checking out how we can match words with patterns using Regular Expressions.

Matching words with patterns:

Consider an input string where you have to match certain words with the string. To elaborate, check out the following example code:

import re

Str = "Sat, hat, mat, pat"

allStr = re.findall("[shmp]at", Str)

for i in allStr:
print(i)

What is common in the string? You can see that the letters ‘a’ and ‘t’ are common among all of the input strings. [shmp] in the code denotes the starting letter of the words to be found. So any substring starting with the letters s, h, m or p will be considered for matching. Any among that and compulsorily followed by ‘at’ at the end.

Output:

hat 
mat
pat

Do note that they are all case sensitive. Regular expressions have amazing readability. Once you get to know the basics, you can start working on them in full swing and it’s pretty much easy, right?

Next up on this Python RegEx blog, we will be checking out how we can match a range of characters at once using Regular Expressions.

Matching series of range of characters:

We wish to output all the words whose first letter should start in between “h” and “m” and compulsorily followed by at. Checking out the following example we should realize the output we should get is a “hat and mat”, correct?

import re

Str = "sat, hat, mat, pat"

someStr = re.findall("[h-m]at", Str)

for i in someStr:
print(i)

Output:

hat 
mat

Let us now change the above program very slightly to obtain a very different result. Check out the below code and try to catch the difference between the above one and the below one:

import re

Str = "sat, hat, mat, pat"

someStr = re.findall("[^h-m]at", Str)

for i in someStr:
print(i)

Found the subtle difference? We have added a caret symbol(^) in the Regular Expression. What this does it negates the effect of whatever it follows. Instead of giving us the output of everything starting with “h” to “m”, we will be presented with the output of everything apart from that.

The output we can expect is words which are NOT starting with letters in between “h” and “m” but still followed by at the last.

Output:

sat 
pat

Next, in this article, I will explain how we can replace a string using Regular Expressions.

Replacing a string:

Next up, we can check out another operation using Regular Expressions where we replace an item of the string with something else. It is very simple and can be illustrated with the following piece of code:

import re

Food = "hat rat mat pat"

regex = re.compile("[r]at")

Food = regex.sub("food", Food)

print(Food)

In the above example, the word rat is replaced with the word food. The final output will look like this. The substitute method of the Regular Expressions is made use of this case and it has a vast variety of practical use cases as well.

Output:

hat food mat pat

Next up on this Python RegEx blog, we will check out a unique problem to Python called the Python Backslash problem.

The Backslash Problem:

Consider an example code shown below:

import re

randstr = "Here is \\Edureka"

print(randstr)

Output:

Here is \Edureka

This is the backslash problem. One of the slashes vanished from the output. This particular problem can be fixed using Regular Expressions.

import re

randstr = "Here is \\Edureka"

print(re.search(r"\\Edureka", randstr))

The output can be as follows:

<re.Match object; span=(8, 16), match='\\Edureka'>

As you can check out, the match for the double slashes has been found. And this is how simple it is to solve the backslash problem using Regular Expressions.

Next, in this article, I will walk you through how we can match a single character using Regular Expressions.

Matching a single character:

A single character from a string can be individually matched using Regular Expressions easily. Check out the following code snippet:

import re

randstr = "12345"

print("Matches: ", len(re.findall("\d{5}", randstr)))

The expected output is the 5th number that occurs in the given input string.

Output:

Matches: 1

Next, in this article, I will walk you through how we can remove newline spaces using Regular Expressions.

Removing Newline Spaces:

We can remove the newline spaces using Regular Expressions easily in Python. Consider another snippet of code as shown here:

import re

randstr = '''
You Never
Walk Alone
Liverpool FC
'''

print(randstr)

regex = re.compile("\n")

randstr = regex.sub(" ", randstr)

print(randstr)

Output:

You Never
Walk Alone
Liverpool FC

You Never Walk Alone Liverpool FC

As you can check out from the above output, the new lines have been replaced with whitespace and the output is printed on a single line.

There are many other things you could use as well depending on what you want to replace the string with. They are listed as follows:

  • \b: Backspace
  • \f: Formfeed
  • \r: Carriage Return
  • \t: Tab
  • \v: Vertical Tab

Consider another example as shown below:

import re

randstr = "12345"

print("Matches:", len(re.findall("\d", randstr)))

Output:

Matches: 5

As you can see from the above output, \d matches the integers present in the string. However if we replace it with \D, it will match everything BUT an integer, the exact opposite of \d.

Next, in this article, let us walk through some important practical use-cases of making use of Regular Expressions in Python.

Practical Use Cases Of Regular Expressions

We will be checking out 3 main use-cases which are widely used on a daily basis. Following are the concepts we will be checking out:

  • Phone Number Verification
  • E-mail Address Verification
  • Web Scraping

Let us begin this section of Python RegEx tutorial by checking out the first case.

Phone Number Verification:

Problem Statement — The need to easily verify phone numbers in any relevant scenario.

Consider the following Phone numbers:

  • 444–122–1234
  • 123–122–78999
  • 111–123–23
  • 67–7890–2019

The general format of a phone number is as follows:

  • Starts with 3 digits and ‘-‘ sign
  • 3 middle digits and ‘-‘ sign
  • 4 digits in the end

We will be using \w in the example below. Note that \w = [a-zA-Z0–9_]

import re

phn = "412-555-1212"

if re.search("\w{3}-\w{3}-\w{4}", phn):
print("Valid phone number")

Output:

Valid phone number

E-mail Verification:

Problem statement — To verify the validity of an E-mail address in any scenario.

Consider the following examples of email addresses:

  • Anirudh@gmail.com
  • Anirudh @ com
  • AC .com
  • 123 @.com

Manually, it just takes you one good glance to identify the valid mail IDs from the invalid ones. But how is the case when it comes to having our program do this for us? It is pretty simple considering the following guidelines are followed for this use-case.

Guidelines:

All E-mail addresses should include:

  • 1 to 20 lowercase and/or uppercase letters, numbers, plus . _ % +
  • An @ symbol
  • 2 to 20 lowercase and uppercase letters, numbers and plus
  • A period symbol
  • 2 to 3 lowercase and uppercase letters

Code:

import re

email = "ac@aol.com md@.com @seo.com dc@.com"

print("Email Matches: ", len(re.findall("[\w._%+-]{1,20}@[\w.-]{2,20}.[A-Za-z]{2,3}", email)))

Output:

Email Matches: 1

As you can check out from the above output, we have one valid mail among the 4 emails which are the inputs.

This basically proves how simple and efficient it is to work with Regular Expressions and make use of them practically.

Web Scraping

Problem Statement — Scrapping all of the phone numbers from a website for a requirement.

To understand web scraping, check out the following diagram:

We already know that a single website will consist of multiple web pages. And let us say we need to scrape some information from these pages.

Web scraping is basically used to extract the information from the website. You can save the extracted information in the form of XML, CSV or even a MySQL database. This is achieved easily by making use of Python Regular Expressions.

import urllib.request
from re import findall

url = "http://www.summet.com/dmsi/html/codesamples/addresses.html"

response = urllib.request.urlopen(url)

html = response.read()

htmlStr = html.decode()

pdata = findall("\(\d{3}\) \d{3}-\d{4}", htmlStr)

for item in pdata:
print(item)

Output:

(257) 563-7401
(372) 587-2335
(786) 713-8616
(793) 151-6230
(492) 709-6392
(654) 393-5734
(404) 960-3807
(314) 244-6306
(947) 278-5929
(684) 579-1879
(389) 737-2852
(660) 663-4518
(608) 265-2215
(959) 119-8364
(468) 353-2641
(248) 675-4007
(939) 353-1107
(570) 873-7090
(302) 259-2375
(717) 450-4729
(453) 391-4650
(559) 104-5475
(387) 142-9434
(516) 745-4496
(326) 677-3419
(746) 679-2470
(455) 430-0989
(490) 936-4694
(985) 834-8285
(662) 661-1446
(802) 668-8240
(477) 768-9247
(791) 239-9057
(832) 109-0213
(837) 196-3274
(268) 442-2428
(850) 676-5117
(861) 546-5032
(176) 805-4108
(715) 912-6931
(993) 554-0563
(357) 616-5411
(121) 347-0086
(304) 506-6314
(425) 288-2332
(145) 987-4962
(187) 582-9707
(750) 558-3965
(492) 467-3131
(774) 914-2510
(888) 106-8550
(539) 567-3573
(693) 337-2849
(545) 604-9386
(221) 156-5026
(414) 876-0865
(932) 726-8645
(726) 710-9826
(622) 594-1662
(948) 600-8503
(605) 900-7508
(716) 977-5775
(368) 239-8275
(725) 342-0650
(711) 993-5187
(882) 399-5084
(287) 755-9948
(659) 551-3389
(275) 730-6868
(725) 757-4047
(314) 882-1496
(639) 360-7590
(168) 222-1592
(896) 303-1164
(203) 982-6130
(906) 217-1470
(614) 514-1269
(763) 409-5446
(836) 292-5324
(926) 709-3295
(963) 356-9268
(736) 522-8584
(410) 483-0352
(252) 204-1434
(874) 886-4174
(581) 379-7573
(983) 632-8597
(295) 983-3476
(873) 392-8802
(360) 669-3923
(840) 987-9449
(422) 517-6053
(126) 940-2753
(427) 930-5255
(689) 721-5145
(676) 334-2174
(437) 994-5270
(564) 908-6970
(577) 333-6244
(655) 840-6139

We first being by importing the packages which are needed to perform the web scraping. And the final result comprises of the phone numbers extracted as a result of the web scraping done using Regular Expressions.

Conclusion

I hope this Python RegEx tutorial helps you in learning all the fundamentals needed to get started with using Regular Expressions in Python.

This will be very handy when you are trying to develop applications that require the usage of Regular Expressions and similar principles. Now, you should also be able to use these concepts to develop applications easily with the help of Regular Expressions and Web Scraping too.

If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Python and Data Science.

1. Python Tutorial

2. Python Programming Language

3. Python Functions

4. File Handling in Python

5. Python Numpy Tutorial

6. Scikit Learn Machine Learning

7. Python Pandas Tutorial

8. Matplotlib Tutorial

9. Tkinter Tutorial

10. Requests Tutorial

11. PyGame Tutorial

12. OpenCV Tutorial

13. Web Scraping With Python

14. PyCharm Tutorial

15. Machine Learning Tutorial

16. Linear Regression Algorithm from scratch in Python

17. Python for Data Science

18. Loops in Python

19. Python Projects

20. Machine Learning Projects

21. Arrays in Python

22. Sets in Python

23. Multithreading in Python

24. Python Interview Questions

25. Java vs Python

26. How To Become A Python Developer?

27. Python Lambda Functions

28. How Netflix uses Python?

29. What is Socket Programming in Python

30. Python Database Connection

31. Golang vs Python

32. Python Seaborn Tutorial

33. Python Career Opportunities

34. Machine Learning Classifier in Python

35. Python Scikit-Learn Cheat Sheet

36. Machine Learning Tools

37. Python Libraries For Data Science And Machine Learning

38. Chatbot In Python

39. Python Collections

40. Python Modules

41. Python developer Skills

42. OOPs Interview Questions and Answers

43. Resume For A Python Developer

44. Exploratory Data Analysis In Python

45. Snake Game With Python’s Turtle Module

46. Python Developer Salary

47. Principal Component Analysis

48. Python vs C++

49. Scrapy Tutorial

50. Python SciPy

51. Least Squares Regression Method

52. Jupyter Notebook Cheat Sheet

53. Python Basics

54. Python Pattern Programs

55. Generators in Python

56. Python Decorator

57. Python Spyder IDE

58. Mobile Applications Using Kivy In Python

59. Top 10 Best Books To Learn & Practice Python

60. Robot Framework With Python

61. Snake Game in Python using PyGame

62. Django Interview Questions and Answers

63. Top 10 Python Applications

64. Hash Tables and Hashmaps in Python

65. Python 3.8

66. Python Visual Studio

67. Support Vector Machine In Python

Originally published at www.edureka.co on March 8, 2019.

--

--

Aayushi Johari
Edureka

A technology enthusiast who likes writing about different technologies including Python, Data Science, Java, etc. and spreading knowledge.