Python Programming | difflib

rnab
Boring Tech
Published in
3 min readJan 18, 2019

This module in the python standard library provides classes and functions for comparing sequences like strings, lists etc. In this article we will look into the basics of SequenceMatcher, get_close_matches and Differ.

SequenceMatcher a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980’s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that look right to people.

So, let’s see how to use it.

from difflib import SequenceMatcherstr1 = 'abcd'
str2 = 'abcde'
seq = SequenceMatcher(a=str1, b=str2)
print(seq.ratio())

The SequenceMatcher class accepts two pararmeters a and b and it compares the similarity of b to a and gives us a score or ratio of similarity. So the above code outputs 0.88888 ratio which means str2 is 80% similar to str1 .

The get_close_matches function gives us the top similar words from a list that is similar to a given string.

from difflib import get_close_matches
word_list = ['acdefgh', 'abcd','adef','cdea']
str1 = 'abcd'
matches = get_close_matches(str1, word_list, n=2, cutoff=0.3)
print(matches)

Here n is the number of top similar words we want in the output and cutoff is the minimum ratio value required for that word in order to classify it as similar. So this piece outputs ['abcd', 'abcdefgh'] , if we increase the cutoff to 0.7 it will only output ['abcd'] as that is the only word in the list that will give a similarity ratio of >0.7. This function comes in very handy when making a quick ‘typo detection code’ , for example if we write ‘appl’ it can suggest did you mean ‘apple’.

The Differ class provides a human readable of the deltas in two sequences.

from diffib import Differ
from pprint import pprint

txt1 = '''
hello world.
we like python.'''.splitlines()
txt2 = '''
hello world.
we like python coding'''.splitlines()
dif = Differ()df = list(dif.compare(txt1, txt2))pprint(df)

This gives us an output like this.

output of Differ.compare()

Here we can see that it compares txt2 with txt1 and gives us a human readable structure showing what changed in txt2 from txt1.

As we can see here ‘hello world’ is same in both the sequences but the second sentence has changed and its showing that ‘coding’ is the change in the second sentence of both the strings. Here’s the video tutorial for this

There are lot more cool and complex functions in the module difflib , do check out the official python documentation of this module.

--

--