Can five five-letter words cover 25 characters?

Lorenz Sparrenberg
7 min readJan 11, 2023

--

How to solve a Wordle inspired question with integer linear programming

Word cloud showing five-letter words
Image by the author

Introduction

On August 3, 2022, the YouTuber Matt Parker posted on his channel Stand-Up Maths a video about the Wordle inspired question: “Can you find: five five-letter words with twenty-five unique letters?”. In his video, he used a brute-force approach to find in total 538 five five-letter word combinations covering 25 letters of the alphabet. Using Python for programming, his code took around a month (!) on his laptop to find all matches in a given list of over 370,000 words. The ridiculously long calculation time for solving the problem motivated numerous followers of his channel to submit their own solutions, and a competition broke out to find the fastest algorithm. Matt Parker summarized the results in a follow-up video on October 17, 2022: “Someone improved my code by 40,832,277,770%”. The fastest solutions contain optimizations in many places and reduce the computing time to fractions of a second. In today’s post, I don’t want to go into the details to achieve these results. Instead I want to present a completely different approach to the problem, the formulation of the task as an integer linear programming problem (ILP problem). This approach is well suited to solve optimization problems or to quickly find a possible solution. However, the method is less appropriate for finding all possible solutions to a problem, as Matt Parker has done.

Problem formulation

Let us start with the mathematical description of the problem. Each word w of our word list W is assigned a binary variable v that can either take the value 1, meaning that the word is included in the solution, or 0, meaning that it is not included:

Furthermore, only 5 word combinations are allowed as valid solutions:

and each letter x of the alphabet A may only occur once in a valid word combination:

With these boundary conditions, we can start coding our Python solution.

Implementation in Python

First of all, we need to read in the word list containing the allowed words. We will use the word list “words_alpha.txt”, which Matt Parker also used in his video. It can be found on GitHub.

import pandas as pd

file = 'words_alpha.txt' #contains 370105 different words
df = pd.read_csv(file, names=['word'])

We are only interested in words that have a length of five letters. Therefore, we create a new column containing the length of each word and we keep only words of length five. Then, we filter out words that do not contain vowels to get rid of obvious abbreviations. Finally, we create a column holding the number of different letters in each word to filter for words with five unique characters.

# get length of each word
df['length'] = df['word'].apply(lambda word: len(str(word)))

# keep only words of length 5
dff = df[df.length == 5].copy()

# filter out words which do not contain vows
dff = dff[dff['word'].str.contains('a|e|i|o|u|y')]

# get number of unique characters of each word
dff['count unique'] = dff['word'].apply(lambda word: len(set(word)))

# keep only words with 5 unique characters
word_list = dff[dff['count unique'] == 5]['word'].values

We end up with a list of 10172 words, each consisting of five different letters.

To set up the ILP problem, there is the powerful library PuLP in Python. A detailed documentation on the library can be found here. We load the library and initialize our ILP problem:

import pulp as plp

# Using anaconda, you can simply install the package with the following command:
# conda install -c conda-forge pulp
# Under linux an additional solver is required which can be installed as follows:
# sudo apt-get install glpk-utils

# initialize problem
prob = plp.LpProblem('Wordle')

Now we can define our variables. For each word in our word list, we create a binary variable that indicates whether a word appears in our solution or not:

# set variables
variables = plp.LpVariable.dict('words', (word_list), lowBound=0, upBound=1, cat='Binary')

Then, we add our above boundary conditions to the problem.

# constraint 1: only five words
prob += plp.lpSum([variables[word] for word in word_list]) == 5

# constraint 2: each char is allowed to occur only once
for char in 'abcdefghijklmnopqrstuvwxyz':
prob += plp.lpSum([variables[word] for word in word_list if char in word]) <= 1

We run the solver with the following command:

# solve problem
prob.solve()

The console will output whether a solution was found for our conditions. On my system, a solution was calculated in 0.1 seconds, which can be read out as follows:

# retrieve results       
results = [word for word in word_list if variables[word].value() == 1.0]

And we get as solution:

['frack', 'jowly', 'pbxes', 'vingt', 'zhmud']

To be honest, these are rather unfamiliar words, but they are included in the given list and represent a valid solution. To obtain further possible solutions, we could exclude the found results by providing new constraints and run the code again.

# exclude solution
prob += plp.lpSum([variables[r] for r in results]) <= 4

Although we can find all possible solutions this way, I do not want to go further down this path. Instead, I would like to show how PuLP can be used to find an optimal solution according to a given objective function.

Finding optimal solution

To find an optimal solution, we want to assign frequencies to the given words. Then we can use PuLP to optimize the solution to find more commonly used words. For this task, we consult the Google Books Ngram data set from Kaggle which contains over 9.2 million words with their associated occurrence frequency from millions of book scans. Since this data set also contains spelling and reading errors, we will only use it to determine the occurrence frequencies of the words in our list. Processing of the data set is analogous to the processing of the original word list:

# load Google Books Ngram data set from Kaggle
file_freq = 'ngram_freq.csv'
# file_freq = 'unigram_freq.csv'
df_freq = pd.read_csv(file_freq)

# get length of words
df_freq['length'] = df_freq['word'].apply(lambda word: len(str(word)))
dff_freq = df_freq[df_freq['length'] == 5].copy()

# get number of unique characters of each word
dff_freq['length'] = dff_freq['word'].apply(lambda word: len(set(word)))

# keep only words with 5 unique characters
dfff_freq = dff_freq[dff_freq.length == 5].reset_index(drop=True)

# transfer data into dictionary for later look up
freq_dict = {dfff_freq.loc[i, 'word']: dfff_freq.loc[i, 'count'] for i in dfff_freq.index}

Now we can merge the words from our word list with the corresponding occurrence frequencies. If a word is not found in the Ngram data set, we include it with a count value of 1.

word_dict = {}
not_found_count = 0

for word in word_list:
try:
# try to get counts from NGram data set
count = freq_dict[word]
word_dict[word] = count
except KeyError:
# if key is not found in NGram data set, add it to list with a count of 1
word_dict[word] = 1
not_found_count += 1

word_list = list(word_dict.keys())

# length of our word list and number of keys not in the NGram data set
len(word_dict), not_found_count

# (10172, 119)

Let us adapt the PuLP code to this new task. First, we defined our maximization goal in the problem initialization. Then, we add an objective function. Each variable is weighted by the corresponding frequency of its occurrence. Since the word frequencies span several orders of magnitude, we apply the logarithm to the weights to avoid a single extremely frequent word dominating our solution.

import numpy as np

# set variables
variables = plp.LpVariable.dict('words', (word_list), lowBound=0, upBound=1, cat='Binary')

# initialize problem
prob = plp.LpProblem('Find_five_words', plp.LpMaximize)

# objective function
prob += plp.lpSum([variables[word] * np.log(word_dict[word]) for word in word_list])

# constraint 1: only five words
prob += plp.lpSum([variables[word] for word in word_list]) == 5

# constraint 2: each char is allowed to occur only once
for char in 'abcdefghijklmnopqrstuvwxyz':
prob += plp.lpSum([variables[word] for word in word_list if char in word]) <= 1

prob.solve()

results = [word for word in word_list if variables[word].value() == 1.0]

This leads to the following result on my computer after 0.4 seconds:

['dwarf', 'glyph', 'jocks', 'muntz', 'vibex']

This is the same solution Matt Parker presents in minute 10:19 of his video, after weighting his results with their occurrences in Google Web Search.

Wrap-up

We have seen how formulating a question as an ILP problem can provide an alternative approach to the problem of finding five five-letter words with 25 different characters. In addition, we learned how to optimize the search with an objective function by assigning frequencies to words. The interesting thing about ILP is that code can be quickly adapted to completely different tasks. In fact, many questions can be formulated as ILP problems, e.g., creating conference schedules or timetables, supplying stores, or the knapsack problem that is often asked in coding interviews. So it’s worth familiarizing yourself with this technique.

The complete code can be found on GitHub. If you are interested in more of my projects, have a look at my webpage.

--

--