Fuzzy Wuzzy Searching

Morgan Murphy
4 min readApr 17, 2018

--

This post reviews how I modified the search functionality of the wine recommender I worked on for my capstone project.

Yes, it’s called “Fuzzy Wuzzy” searching, so let’s just get this out right here and then move on:

Fuzzy Wuzzy searching was created by SeatGeek (click here), which sells tickets for various entertainment venues, to help with string matches. What is string matching? Let’s say you ask someone to type in their job title into a text box. People may type in, “Data Scientist,” or “data scientist”, or “Data Science” into that text box. Those three things mean the same thing, but if you ask the computer to search for the job title, “Data Scientist,” it may only bring back the first one as a match, since it would view the lowercase letters and the variation of “science” from “scientist” as something different. There may be ways to make the search algorithm more complicated, but then we would not get to use something called “fuzzy wuzzy” and where’s the fun in that? (Answer: Nowhere. The fun is nowhere.)

For our wine recommender, we need to ask users to type in the title of a wine so that the recommender can go find that title in the database, find the closest cosine similarity scores, and bring back the wines associated with those scores. Wine titles can be pretty complex. For example, here is one of the recommendations from the last post:

If asked to enter in one of these wine titles, we may not remember where the year goes, or what region is in parenthesis. We might get close, though. That’s where fuzzy wuzzy searching helps.

Fuzzy Wuzzy searching helps find approximate matches to strings (letters), rather than exact matches. Fuzzy Wuzzy takes in a few parameters, shown in this code:

First, it needs a key — that’s the approximate term you are searching for.

Second, it needs the choices it is going to search through — that is the list of possible options.

Easy so far. Now it gets complicated. Next, it needs the type of search. There are a few different types of searches (nicely summarized here). The one I chose is “token_set_ratio” which first tokenizes the words (remember what that means from my earlier post?), finds the most common matches, and then figures how to match the remainder, or the words left over. Those reading this post (okay, the one person reading this post) will remember what the “remainder” from long division means — it is the part that is left over. So, after the fuzzy wuzzy search finds the most matching words, it then figures out what to do with the rest of the words it did not match to begin with.

Here is an example. I want to search for “geyser peak chardonnay” but I don’t remember what year or what region and I don’t use capital letters.

I type it in and these are returned:

For this first input of the recommender, I had Fuzzy Wuzzy bring back the five closest matches (using “process.Extract” and the limit parameter). Next, I ask the user to enter in the wine they want to search for and do a second Fuzzy Wuzzy search (using the “process.extractOne” method) to bring back the top match for the recommender to use.

I put in geyser peak 2014 chardonnay. Notice I am still entering in an approximation of the wine title, not an exact match. And it brings back the top 10 matches we saw from the last post.

So, that’s how Fuzzy Wuzzy searching helped us make our wine recommender more user friendly. Thank you, SeatGeek!

Here is the code of the recommender so far. I’m still a new at this (a “newb” they say, whoever “they” are), and I hope to keep improving it.

In my next post, I’ll share the evaluation metric I used to evaluate how accurate the recommendations are for our recommendation system. Since we are using a content-based recommender, and do not have actual user ratings, the accuracy can only be judged by the tasting notes, or the content, we have to work with.

See you next time.

--

--