Photo Credits: Malware Bytes

Intro to Regexes & Strong Password Detection in Python

Build a Password Detector from Scratch

Akeel Ahamed
Sep 4 · 9 min read

Regular Expressions (or Regexes) are huge time-savers, not just for software users but also for Programmers and Data Scientists. Tech writer Cory Doctorow argues that even before learning to program, we should be learning regular expressions:

“Knowing [regular expressions] can mean the difference between solving a problem in three steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.”

Regular Expressions are a mini language for specifying text patterns. In this article I will give you a brief introduction to using regular expressions in Python and apply it to a real world problem of creating a password detector.

To demonstrate how much time we can save by having basic knowledge of regexes, let me give you an example of the code used to extract phone numbers in a string of text with and without using regular expressions. For simplicity, let us assume that a phone is valid if it is in the format ddd-ddd-dddd.

Code to print phone numbers in a string of text without using regular expressions.

The above code shows how much effort is needed to simply print all the phone numbers in a string. Now let us see how regexes can simplify the problem:

Code to print phone numbers in a string of text using regular expressions.

Fantastic! In just 3 lines of code, we managed to do the same task!

Hopefully, now that I have your attention, let us explore how we can use regular expressions in Python.

Regular Expressions in Python

Before we move further, I would like to explain the difference between raw strings and normal strings.

Difference between raw strings & normal strings

A raw string is specified using ‘r’ before beginning the string in Python. It treats the backslash (\) as a literal character. Recall that escape characters in Python use the backslash (\).The string value ‘\n’ represents a single newline character, not a backslash followed by a lowercase n.

In order to specify a backslash followed by a lowercase n, you need to enter the escape character ‘\\’ to print a single backslash, followed by ‘n’. Thus, ‘\\n’ is the string that represents a backslash followed by a lowercase ‘n’. However, by inserting an ‘r’ before the first quote of the string, you can mark the string as a raw string, which does not escape characters.

Python has a built in module to work with regular expressions called re. Let us now explore some of the commonly used methods in the re module:

  • re.match(pattern, string): matches a pattern specified at the beginning of a string and returns a match object if the pattern is present. Otherwise, it returns ‘None’.

Since the output of re.match is a match object, we can use the group() method to return the matched expressions.

Optionally, you can also specify a pattern separately using the re.compile(pattern) function that takes the pattern as an argument.

  • re.search(pattern, string): matches only the first occurence of a pattern in the string.
  • re.findall(pattern, string): returns all occurrences of the searched pattern within a string (in a list format).

This is more powerful than the latter two methods and is one that I prefer using.

  • re.sub(pattern, replacement, string): returns the string obtained by replacing the occurrences of pattern in the string, by replacement. If there is no pattern found, the original string is returned.

Metacharacters & Special Sequences

Regular expressions in general can be specified using a combination of metacharacters and special sequences.

Metacharacters are characters that are interpreted in a special way by a regex engine. The following are a list of commonly used metacharacters:

[] . ^ $ + ? {} () \ | .

Square brackets [abc]

Square brackets match any character between the brackets (such as a, b or c) in a string.

In the three examples above, string 1 returns two matches, string 2 returns four matches and the last string has no match (as there are no letters a,b or c in string 3).

Note that:

  • Metacharacters such as “[] . ^ $ + ? {} () \ | .” lose their meaning inside square brackets. For example, [(*+)] will match any instance of the literal characters ‘[’, ‘(’, ‘*’, ‘+’, ‘)’ or ‘]’.

Period (.)

The period matches any one character, except the newline (\n) character (similar to a wildcard character).

Caret (^)

The caret symbol checks if a string begins with a certain character.

Dollar($)

The dollar symbol checks if a string ends with a certain character.

Overtime, between the caret and dollar symbols, it is easy to forget which one comes first. A mnemonic I found to be useful is “Carrots cost dollars!”.

Star (*)

The start symbol matches zero or more occurences of the pattern to the left of it.

Plus (+)

The plus symbol matches one or more occurences of the pattern to the left of it.

Braces {}

Taking the pattern r{x,y} this will match at least x and at most y repetitions of the pattern ‘r’.

Alternation |

The alternation or the “or” operator. Suppose A and B are regular expressions, then A|B will match instances that contain either the expression A or B.

Grouping ()

The parantheses () group together the expression contained inside them. For example, the expression (a|b|c)xy will match all instances containing the characters “a” or “b” or “c”, followed by “xy”.

Question Mark ?

The question mark symbol matches zero or one occurences of the pattern to the left of it.

Backslash \

A backslash is used to escape various characters including all the metacharacters discussed. By ‘escape’, we mean it will invoke the next character called. For example, \$a will match instances with a “$” followed by “a” and is not interpreted by the regex engine in a special way.

However, there are some special sequences that the regex engine does interpret in a special way. Special Sequences make commonly used patterns much easier to write. Some of them are:

\d

Matches any numeric digit from 0 to 9; this is equivalent to the class [0-9].

\D

Matches any character that is not a numeric digit from 0 to 9; this is equivalent to the class [^0-9].

\s

Matches any space, tab or newline character (i.e. whitespace characters); this is equivalent to the class [ \t\n\r\f\v].

\S

Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w

Matches any letter, numeric digit or underscore character (i.e. alphanumeric characters); this is equivalent to the class [a-zA-Z0-9_].

\W

Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

A more detailed list of special characters can be seen here.

Note:

  • Special Sequences are accepted inside the square brackets metacharacter. For example, [\w] will match any instance of a “word” character (letter, numeric digit or underscore character).

Review Exercise

Let us now use regular expressions to extract a grocery list from a string of text. Before looking at the method I have used, I would encourage you to try this out first using the string given:

Extracting Grocery list from string using regexes.

Password Detector Program

Using our knowledge of regexes, we will now build a small program to detect the strength of a password entered.

Password Detector Program

The password strength is based on 4 main checks (feel free to add more if you like):

  1. The password should have at least 8 characters.
  2. The password should contain at least one lowercase character.
  3. The password should contain at least one uppercase character.
  4. The password should contain at least one digit.

Summary Notes

There has been a lot covered in this article so far. Here is a brief review of the steps to use regular expressions in Python:

  1. Import the regex module with import re.
  2. Create a Regex object with the re.compile() function. (Remember to use a raw string.)
  3. Pass the string you want to search into the Regex object’s search() method. This returns a Match object.
  4. Call the Match object’s group() method to return a string of the actual matched text.
Summary of steps 1–4

Notice that “.group()” method did not return all the matches of phone numbers in the string. To return all the matched text corresponding to the pattern searched, we can use the “findall()” method:

The findall method works by passing in two arguments: the pattern to search for and the string in which the pattern is to be searched.

Note that you can also choose to skip steps 2,3 and 4 and instead use the findall() method directly.

The following is a great review of the symbols learned in this article extracted from here:

  • The ? matches zero or one of the preceding group.
  • The * matches zero or more of the preceding group.
  • The + matches one or more of the preceding group.
  • The {n} matches exactly n of the preceding group.
  • The {n,} matches n or more of the preceding group.
  • The {,m} matches 0 to m of the preceding group.
  • The {n,m} matches at least n and at most m of the preceding group.
  • {n,m}? or *? or +? performs a nongreedy match of the preceding group.
  • ^spam means the string must begin with spam.
  • spam$ means the string must end with spam.
  • The . matches any character, except newline characters.
  • \d, \w, and \s match a digit, word, or space character, respectively.
  • \D, \W, and \S match anything except a digit, word, or space character, respectively.
  • [abc] matches any character between the brackets (such as a, b, or c).
  • [^abc] matches any character that isn’t between the brackets.

Regex Practice

If you would like to practice the concepts learned in this article, some helpful resources are regex101 and regexone.

The Jupyter Notebook for this article can be found here.

Hope you enjoyed learning about regular expressions and have a fabulous day!

Resources:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Akeel Ahamed

Written by

Data Scientist | Learning through data | Lifelong learner

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade