Fuzzy/percentage matching of strings in Python
I recently finished a project dealing with movies and one of the main complications of the project was connecting the different data sources together using movie titles. You would think just using the in operator or the pandas .str.contains() would help solve my problems. This would’ve have been a quick solution if I was dealing with a dataset of only about 20 or so strings and I could hard code the movie titles together to get them matching with the .replace(), but the data set contained over 10,000 movies and this wasn’t a viable solution. There were also cases in the different data sources where the difference in the movie titles were if a semi colon was used or if the number 7 was written out or it was roman numerals. (ex. (Star Wars: Episode VII - The Force Awakens),(Star Wars : The Force Awakens)) I did what every decent programmer does I googled to see if anyone has already solved my problem. I ran into a couple of user made libraries/modules, but I actually found a Module in the Python Standard Library that solved my problems. It was the .get_close_matches() in the difflib library.
.get_close_matches(word, possibilities, n, cutoff)
word: The string used to get close matches
possibilities: List of strings to match with the chosen word
n: Optional parameter, default of 3, n > 0,number of maximum closes matches
cutoff: Optional parameter, default of .6, range of [0,1], percentage wise how close the strings in the list are to the chosen string
The .get_close_matches method returns a list of best matched strings that satisfy the percentage cutoff.
Link below to test code:
Test different parameters with link to browser ide below.