The Infamous Dictionary Attack in Words and Code

With Python Examples & Rudimentary Password Analysis

Ariane Horbach
6 min readMay 20, 2022
Image by Lacie Slezak at Unsplash

To know or not to know?
If working in tech, even if you are not a security expert per se, it makes sense to familiarize one-self at least with some basic attack vectors. That way we can actively contribute to keeping our colleagues and data save from hackers. In this article, I will provide you with some information and code-snippets covering the “legendary” brute-force attack that was mentioned in the title. Hopefully you will find some value in this writing. If not, that is ok too. Feel free to skip to the bottom of the page for the link to my code on GitHub.

What is a dictionary attack ?
During a dictionary attack a hacker is illegally trying to get access to a system by attempting to log in with a password or passphrase-guess. Here he or she will try to brute-force her way into the system using common passwords, names, birthdays, etc. While such attacks require quite some effort on the hacker’s side, the rewards for a success may be tremendous. For example, the criminal may steal personal data & valuables or hijack the user’s system for malicious activity. Strong password requirements can be a good protection. Here an example of a strong password requirement:

  1. At least 12 characters
  2. A mix of lower and upper case letters
  3. Inclusion of at least one special character e.g., ! @ # ?
  4. Inclusion of at least one number

Of course such passwords are notoriously hard to remember. Hence such requirements are often not very popular amongst users.

How does the attack work in code?
In general, hackers will start out with a file that contains previously leaked passwords. One such infamous leaked password file is the RockYou.txt. Chances are good that the passwords that have already been used are still in use. Therefore, the hacker tries those RockYou.txt passwords against various user accounts until a match succeeds and she gains access.

One got unlucky ©Ariane Horbach: Image created with ‘Diagram.io.’

Have a look at the Python function below. (full code can be found at the bottom of the article). Every word in the RockYou.txt is read and then hashed to sha256. Hashing is used to maintain the integrity of user passwords. Then the hashed word is matched against the actual password’s hash to simulate the attack.

# variable declaration in DictionaryAttack Class
self.password = password
self.password_hash = hashlib.sha256(password.encode()).hexdigest()
# Return values that indicate if attempt was success or not
self.success = "The password was found: "
self.fail = "The password could not be cracked "
def crack_password(self):
# open file and parse the contents
with open( r"path to rockyou.txt", "r",
encoding="latin-1") as file:
words = file.read().split()
# hash every word and then try to see if it matches the passwords
# hashed version. If it does match, return success and the password. # Otherwise return failure message.
for word in words:
word_hash=hashlib.sha256(word.encode()).hexdigest()
if word_hash == self.password_hash:
return self.success + self.password
else:
return self.fail

Well that is terribly slow. How to make it faster ?
Now, image that RockYou.txt has 8.4 billion entries of passwords. Opening and looping through it like in the previous example will take a long time. One quick fix to make the code faster is to use Pandas Dataframes. The underlying logic remains the same, we have hashed words and then try them against the hashed passwords. However, what we do is to make use of pre-calculating the hashes. Then no more looping is required and Pandas does the heavy lifting.
First the data is read using the pandas API.

path = path to rockyou.txt§ reading the data from the txt file
df = pd.read_csv(
path,
delimiter="\n",
header=None,
names=["Words"],
encoding="ISO-8859-1",
)

This dataframe (df) has now exactly one column Words. Now we create the Hashes as a second column.

# function to hash the words in the pandas dataframe to sha256
def sha_encode(word):
return hashlib.sha256(word.encode()).hexdigest()
df["Hash"] = df["Words"].astype(str).apply(sha_encode)

Now the solution can simply be fetched with .loc as follows:

def crack_password(self, password):
# simulating the hashed password
password_hash = hashlib.sha256(password.encode()).hexdigest()
# See if hashed word matches the hashed password
solution = self.df.loc[self.df["Hash"] == password_hash]
# if solution is empty, the attack failed otherwise return password
if solution.empty:
return self.fail
else:
return self.success + password

The full code contains a timer with which you can compare each approach. Indeed you will find that the Pandas version is significantly faster. An attacker will likely try to optimize the code to the maximum to increase its speed. Although one may argue that this simple code does not have great room for further improvements.

What if the initial attack has not been successful ?
Well, coming up with new passwords to try is much more elaborate and tedious. Therefore, the dictionary attack is probably more of a first attempt for luck. What the hacker can do though is to analyze leaked passwords such as the RockYou file and look for patterns. Then, after finding a pattern, a rule based extension of passwords is possible. Consider this pattern matching example using a simple regular expression:

# how many passwords in total ?
count_passwords = len(df)
# out of all the passwords with special characters, how many contain # the exclamation ! mark ?
count_password_with_exclamation_mark = df[df.Passwords.str.contains('!', na=False)].count()
percentage = count_password_with_exclamation_mark / count_passwords

It turns out that out of 14.343.476 only 0.008% contain the ! sign. Interestingly though, most people put the ! at the end of their password as they would if writing text. A common password indeed is hello .
Simply by adding the ! to the last position of the password, the chances of success are increased. The previous example hello! is logically an interesting new attempt. Surely patterns can get much more complex than the previous example. Let’s have a look on how you would extend the Pandas dataframe with the newly generated passwords:

# Let's see first if there are any weak passwords
def identify_weak_passwords(self, df):
# pattern to match a strong password (no minimum amount of
# characters defined for simplicity)
strong_password_regex = '(?=.*\d)(?=.*[!@#$%^&*]+)(?![.\n](?=.*[A-Z])(?=.*[a-z]).*$'# negate the pattern with the ~ sign. to get a boolean expression
df["is_Weak_Password"] = ~df.Passwords.str.contains(strong_password_regex, na=False)

return df
# now generate stronger passwords out of the weak passwords
def generate_additional_passwords(self, df):

expression_one = df.loc[df['is_Weak_Password'] == True].Passwords + "!"
expression_two = df.loc[df['is_Weak_Password'] == True].Passwords + str(random.randint(1940, 2022))
expressions = [expression_one, expression_two]for expression in expressions:
result_df = pd.concat((df["Passwords"], expression), axis=0)

Simple but effective. However, there is a catch with regular expression pattern matching. Regular expressions (regex) evaluation can be notoriously slow. We match strings, which is a slow process. Additionally, regex optimization needs a lot of expertise by the programmer.

Powerful Passwords
If one considers only letters, there are already roughly 94^12 possible combinations for a 12 character long password. Therefore, if requiring strong password integrity from administrator side, it becomes very difficult for hackers to be successful with this method.

Image by Simone Pellegrini at Unsplash

If you are interested in regular expressions, check out the following example to test if you password is satisfying the requirements to be a strong password.

^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%^&+=])(?=\S+$).{8,}$^                 # beginning of the string
(?=.*[0-9]) # a digit must occur at least once
(?=.*[a-z]) # a lower case letter must occur at least once
(?=.*[A-Z]) # an upper case letter must occur at least once
(?=.*[@#$%^&+=]) # a special character must occur at least once
(?=\S+$) # no whitespace allowed in the entire string
.{12,} # anything, at least twelve places
$ # end-of-string

Conclusion:
You are here ! Cool ! As promised here the link to the GitHub for the code. The data to play around with you can find here. Consider following me for more content :)

--

--

Ariane Horbach

Big Data Engineer @DHL IT Services. Loves data & programming, cycling and the outdoors. LinkedIn: https://www.linkedin.com/in/ariane-horbach-60b31b124/