How to Find Anagrams on Urban Dictionary
Recently, I stumbled upon this blog post about a quest to find the most interesting anagrams in Webster’s Dictionary. The next step was obvious. Try and replicate the results with one of Web 2.0’s greatest success stories of user-generated content: Urban Dictionary.
For those unfamiliar, Urban Dictionary covers around two and a half million user-generated words, phrases or really anything that people have defined. Given the staggering amount of content, how could there not be some great anagrams there waiting to be discovered?
Finding the angrams
After doing some basic scripting to load all the words into JSON, the next step was to simply iterate over all the words and try to find the ones that were anagrams.. Instead of trying to calculate every permutation of every word, the great takeaway I learned in that blog post, is that in order to match words:
Reduce each word to a normal form so that two words have the same normal form if and only if they are anagrams of one another. In this case we do this by sorting the letters into alphabetical order, so that both megalodon and moonglade become adeglmnoo.
After running the program, I discovered 365,268 anagrams out of a total of around two and a half million words. Meaning, around 14% of the words are anagrams. This number was way higher than I expected.
Fun Quirks of Urban Dictionary
Urban Dictionary, unlike say, Websters Dictionary, does not have any form of curation or standards control. Many entries seem pointless and in many cases it’s obvious they were only written to entertain the writer. This makes it a trickier data set to deal with in many ways.
There were over 80 anagrams that matched with abcdefghijklmnopqrstuvwxyz. That is, 80 different versions of the letters of the alphabet exist on Urban Dictionary. Most, like kjhgfdsazxcvbnmlpoiuytrewq had definitions that referenced being extremely bored. Then there were the more esoteric, like cwm fjord-bank glyphs vext quiz which apparantly could be used in an actual conversation. I’m very skeptical.
One important note: during this analysis I chose to ignore symbols in the anagram. Just like palindromes, I figured ignoring punctuation and symbols allowed for much more interesting anagrams to emerge. Otherwise, we could never have palindromes like “a man, a plan, a canal, Panama”. However, I was unprepared for the sheer amount of symbols in use on Urban Dictionary. There were twenty-five anagrams that matched with bd. That didn’t seem possible to me until I started examing some of them closely. A few examples:
I can only hope for a day when emoticon anagrams becomes a real field of study.
Every problem that occurs when finding anagrams in Webster’s Dictionary feels amplified to a comical degree on Urban Dictionary. If you naively assume that the longest anagrams must be the most interesting one, you would be dissapointed, but not too upset, to find in Webster’s Dictionary it is “cholecystoduodenostomy” and “duodenocholecystostomy” that hold the title. Those words aren’t even really that long. In the Urban Dictionary data set, however, things are much different. With over two hundred characters each, these two are the longest pairs of anagrams:
Surprising to no one, the definitions for both of these words can be summarized as the state of being extremely bored.
Ranking the Anagrams
With over 300,000 anagrams, I needed a way to sort the most interesting anagrams from the rest. Again, this blog post, did the heavy lifting and came up with a clever and intuitive idea for scoring anagrams. The aproach is simple: the most interesting anagrams are when each word is as far apart from the other as possible.
In other words, how many chunks would it take of the first word so that it could be rearanged as the second word? For example, with the words elbow and below there are only 3 chunks needed: el-b-ow -> b-el-ow, for a final score of 3.
In order to calculate the correct score for each pair of words a rather complex algorithm based on converting the words to a graph structure is needed. Dealing with the constraints of a larger data set, I wrote a close approximation that found a score that was usually good enough which can be found here. One important implication of this approach is that the maximum score for two words would be the length of each word, since there can never be more chunks then there are letters. Thus, the high-scoring words must meet a certain length requirement to even be considered. An unintended side effect, though, is that a high score can easily be achieved if words are written in reverse, which is surprisingly common on Urban Dictionary. A great example is abcdefghijklmnopqrstuvwxyz and zyxwvutsrqponmlkjihgfedcba which combine for a chart-shattering score of 26. I removed them, and many others like it, to focus on real words and phrases.
The Top 15
After sorting all the anagrams by score, I landed on the following list of greatest anagrams, all of which are linked to their defintion on Urban Dictionary:
- classy ignorance — ScaryCongalines (15)
- Snickers slapper — sparkle princess (14)
- shapesturbating — straight up beans (14)
- euphoria glasses — sausage polisher (14)
- choad negligent — Cleaning the Dog (14)
- Exclamationship — mexican hospital (14)
- circle the drain — Technical Rider (14)
- Alphonse Elric — Polish Cleaner (13)
- Death Something — Mendota Heights (13)
- american english — Inhale Screaming (13)
- Ballistic Therapy — Reality Bitch Slap (13)
- To Pull a Chrissie — Tropical Slushie (13)
- asstrampoline — spermsational (13)
- Ghandi’s Toenails — shit and gasoline (13)
- Masonic Temple — special moment (13)
Unlike Webster’s Dictionary, the top scorers aren’t so obviously the clear winners. They’re funny sure, but anyone can realize that a great anagram needs more than just complexity. Great anagrams need to be recognizable and contain a certain amount of irony between the two words so that it’s funny that the two words could be so closely related. Still though, none of the anagrams above are bad by any stretch.
Some of my favorites that didn’t score as well: