Python Programming | difflib
This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc. In this article we will look into the basics of SequenceMatcher
, get_close_matches
and Differ
.
SequenceMatcher
a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that look right to people.
So, let’s see how to use it.
from difflib import SequenceMatcherstr1 = 'abcd'
str2 = 'abcde'
seq = SequenceMatcher(a=str1, b=str2)
print(seq.ratio())
The SequenceMatcher
class accepts two pararmeters a
and b
and it compares the similarity of b
to a
and gives us a score or ratio of similarity. So the above code outputs 0.88888
ratio which means str2
is 80% similar to str1
.
The get_close_matches
function gives us the top similar words from a list that is similar to a given string.
from difflib import get_close_matches
word_list = ['acdefgh', 'abcd','adef','cdea']
str1 = 'abcd'
matches = get_close_matches(str1, word_list, n=2, cutoff=0.3)
print(matches)
Here n
is the number of top similar words we want in the output and cutoff
is the minimum ratio
value required for that word in order to classify it as similar. So this piece outputs ['abcd', 'abcdefgh']
, if we increase the cutoff to 0.7
it will only output ['abcd']
as that is the only word in the list that will give a similarity ratio of >0.7. This function comes in very handy when making a quick ‘typo detection code’ , for example if we write ‘appl’ it can suggest did you mean ‘apple’.
The Differ
class provides a human readable of the deltas in two sequences.
from diffib import Differ
from pprint import pprint
txt1 = '''
hello world.
we like python.'''.splitlines()txt2 = '''
hello world.
we like python coding'''.splitlines()dif = Differ()df = list(dif.compare(txt1, txt2))pprint(df)
This gives us an output like this.
Here we can see that it compares txt2
with txt1
and gives us a human readable structure showing what changed in txt2
from txt1.
As we can see here ‘hello world’ is same in both the sequences but the second sentence has changed and its showing that ‘coding’ is the change in the second sentence of both the strings. Here’s the video tutorial for this
There are lot more cool and complex functions in the module difflib
, do check out the official python documentation of this module.