A Tutorial of Difflib — A Powerful Python Standard Library to Compare Textual Sequences

4 min readJan 27, 2024

Suppose you and your friends work together and spend a weekend completing a manuscript. But when reviewing the manuscript at the end, you find that each of your modifications is slightly different. Countless words have been added, sentences have changed, and even some paragraphs are completely different. What should you do? Would you manually compare page by page, or choose a smart tool to complete complex tasks with one click?

In this case, the difflib Python library might be exactly what you need. It has powerful text comparison functions that can help you quickly find differences and make the integration process easy and enjoyable.

What is difflib?

difflib is part of the Python standard library and can be used without additional installation.

This library is composed of multiple parts, mainly providing classes and functions for comparing differences between sequences and calculating similarity. It can be used to compare files, strings, etc., and can generate various reports of difference results, so we can intuitively see the differences.

Since difflib comes with the Python standard distribution, it supports almost all mainstream Python3 versions. Although this library may not be as famous as other third-party libraries (such as diff in git), difflib is a very useful and powerful tool when dealing with text comparison and merging. Combined with the simplicity and flexibility of Python, it is still particularly important in many situations. You can complete a lot of text comparison work without leaving the Python environment.

Compare character sequences

SequenceMatcher is a class in difflib that can be used to compare the similarity between two sequences (such as strings). It uses the Ratcliff/Obershelp algorithm [1] to calculate the similarity between two sequences.

from difflib import SequenceMatcher

a = """The cat is sleeping on the red sofa."""
b = """The cat is sleeping on a blue sofa..."""

seq_match = SequenceMatcher(None, a, b)
ratio = seq_match.ratio()
print(ratio)  # Check the similarity of the two strings

# The output similarity will be a decimal between 0 and 1, in our example it may output:
# 0.821917808219178

Create a difference report

The unified_diff function can create a “unified difference” report of a string, which is the same format as in many version control systems.

from difflib import unified_diff

diff = unified_diff(a.splitlines(), b.splitlines(), lineterm='')
print('\n'.join(list(diff)))

This will print the difference between the two strings:

---
+++
@@ -1 +1 @@
-The cat is sleeping on the red sofa.
+The cat is sleeping on a blue sofa...

Find the best match

When you have a string and a list, and want to find the item in the list that is most similar to the string, you can use the get_close_matches function.

from difflib import get_close_matches

words = ["disagree", "discover", "display", "disrupt", "distance"]
best_match = get_close_matches('dist', words)
print(best_match)  # Output the list of most similar words

# This will return:
# ['disrupt', 'distance']

Generate HTML difference report

If you prefer a visual comparison report, difflib provides the HtmlDiff class, which can be used to generate an HTML document to display the difference between two sequences.

from difflib import HtmlDiff

d = HtmlDiff()
html_diff = d.make_file(a.splitlines(), b.splitlines()) # a,b were defined earlier
with open("diff.html", "w", encoding="utf-8") as f:
    f.write(html_diff)

Open diff.html in the browser, it may show as follows:

Practical Exercise

To make the content more in-depth, I have prepared a simple exercise for you. In this section, you will try to use various different functions in difflib to experience its power.

Open your Python environment and import difflib.
Create two different short text files text1.txt and text2.txt, write some text with only partially different content.
Use difflib to read these two files and print out their unified differences.
Try the get_close_matches function to find the best match for a given word in a vocabulary list.
Finally, generate and view the HTML difference report of these two text files.

Through these exercises, you will become more familiar with the functions and use cases of the difflib library, and can better use it to solve real-world problems.

Summary

In this tutorial, we learned and practiced the difflib Python standard library, and explored its powerful capability to compare text sequences. Whether it is to compare versions of files or to find the similarity between strings, difflib can provide a convenient and direct solution. Through pre-learning and practical exercises, I believe you can now effectively use this library to handle many text comparison-related tasks you may encounter. Remember, practice is the best way to learn, so don’t hesitate, swing your keyboard, start creating some magical scripts, use difflib to solve problems, and make your work more efficient.

I hope this tutorial inspires you, and I wish you go further and further on the road of programming. If you have any questions about difflib, you might as well read the Python official documentation [2] to get more exquisite details.

References

[1] National Institute of Standards and Technology. Ratcliff/Obershelp pattern recognition.

[2] https://docs.python.org/3/library/difflib.html