DATA STORIES | GAMING | KNIME ANALYTICS PLATFORM

How I used KNIME to Find the Optimal Wordle Starting Word

Visual programming meets gaming

John Emery
Low Code for Data Science

--

As first published on LinkedIn Pulse

Every now and then some craze sweeps through the cultural landscape before fading off into memory. In 2015 we had The Dress (it’s black and blue, people); planking was popular in 2011 for some reason; and now, in early 2022, we have Wordle.

A (very bad) game of Wordle.

The game (which is a blatant copy of the game show Lingo, which debuted in 1987) challenges players to figure out a secret 5-letter word. After each guess, any letters that are in the correct location are shaded green while letters that are in the word but in the incorrect location are shaded yellow. Letters not in the secret word are gray. We are given 6 attempts to figure out the secret word.

After a few plays, many will begin to wonder “is there an optimal starting word? If so, what is it?”

And since I work with data sets all day, I figured I’d try to answer that question and ruin Wordle for everybody.

To do this, I turned to my new favorite tool: KNIME. KNIME is a free and open-source data science, data prep, and data visualization tool. Anybody can download it and get started working with their data immediately. I encourage anybody stuck in an “Excel Hell” to give it a look; it could very well save you hundreds of hours of labor per year.

Getting the Words

I don’t know what dictionary Wordle’s developer used. Maybe there is a way to figure that out from his source code, but I was lazy and am not good at that stuff. So I downloaded a tool called WordNet. Published by Princeton University, WordNet is described as “a lexical database for English.” There is a lot of cool information in the WordNet database, but all I really cared about was its list of words. I consider WordNet a good source for this as it contains lists of nouns, adverbs, adjectives, and verbs, but ignores prepositions, determines, and other word forms that probably don’t appear in the Wordle dictionary.

With WordNet downloaded, I was able to track down the 4 index files that contain all those sweet, juicy words for the analysis to come. I ignored the index.sense file for this analysis.

These are just text files!

As these are simple text files, reading and combining them into one dictionary data set in KNIME is quite easy.

Most words are clearly not suitable for Wordle.

The full dictionary, containing a bit over 155,000 words, is nice but much more than we need. Remember, Wordle is a game about finding a 5-letter word, so we can filter the dictionary to only 5-letter words that don’t contain numbers, hyphens, or anything but our 26 letters.

That filtering complete, we have a working dictionary of 6,684 5-letter words.

Determining the Optimal Starting Word

“Optimal” has become a buzzword in the data and business world over the last few years. Everything has to be optimized; we must have the fastest, cheapest, most efficient.

But without defining what actually constitutes “optimal,” it’s just an empty word.

For this analysis, I have defined “optimal” based on a few criteria:

  • The word must have the average number of vowels found in all 5-letter words.
  • The word’s two vowels must be in the most common two locations for vowels in all 5-letter words.
  • The word’s starting letter must be the most common starting letter in all 5-letter words.
  • No two letters may be the same.
  • Of the remaining words, I will select the word whose letters are the most common on average.

Note that I used [AEIOUY] as my set of vowels. Of course, Y isn’t always a vowel, but figuring out how to conditional treat Y as a vowel or a consonant seemed like too much work. Performing some back testing, including or excluding Y doesn’t really make much of a difference.

The Criteria

As it turns out, the average number of vowels in the set of 5-letter words is almost exactly 2. Further, vowels in the first position are relatively rare, occurring only about 14% of the time, while nearly 60% of the time the second letter is a vowel.

Few words begin with a vowel, but most words have a vowel in the second position.

With these two points established, we can begin filtering down our list of candidates:

  • The word must contain exactly 2 vowels
  • Those vowels must be in the 2nd and 3rd positions

These two filters narrow the list down to 209 words.

Next, we want to determine which starting letter is most common among all 5-letter words. A quick analysis shows that S is, by far, the most common starting letter for our list of all 5-letter words. S is the first letter of about 14% of all words, with C landing in a distant second place at about 8%.

S is the most common first letter in our list of 5-letter words.

We can now further reduce our list:

  • The word must begin with the letter S

Our previous list of 209 entries is now reduced to 7:

My fourth criteria, that there be no repeated letters, may be somewhat controversial. Although words in Wordle can certainly contain repeated letters, I felt that by forcing the word to have 5 distinct letters we would be more likely to hit on a matching letter. Relatively few words have repeats, so I believe this is a reasonable criteria.

Removing words with repeated letters yields this list:

SAINT, SOUGH, SOUND, SOUTH

Finally, I calculated the average rank of the letters for each of these words. By this, I mean the rank of the first letter, the rank of the second letter, and so on.

For example, in the first position S is the most common letter, whereas in the second position A is the most common letter.

With these rankings in hand, calculating each word’s average rank is trivial. For example, the word SAINT breaks down like so:

  • S — 1
  • A — 1
  • I — 2
  • N — 4
  • T — 3

The ranking of our final candidates are as follows:

And there you have it. According to my analysis, the word SAINT is the best word to begin your Wordle journey with.

SAINT

Final Words

Using KNIME for this project was very easy. As all the input data sources were standard text files, bringing them into the KNIME platform was simple. The data processing steps I had to perform, including splitting words into letters, arithmetic, grouping, pivoting, and ranking, are all in KNIME’s wheelhouse. Further, all of the visualizations seen here were made in KNIME.

I’d love to hear your thoughts on this. What changes, improvements, or other criteria you might have considered. Finally, to bolster my argument that this analysis is flawless, I used SAINT today in Wordle and got the correct word on my second guess. SAINT got me three matching letters with one in the correct location.

--

--