A Tutorial of Difflib — A Powerful Python Standard Library to Compare Textual Sequences
Suppose you and your friends work together and spend a weekend completing a manuscript. But when reviewing the manuscript at the end, you find that each of your modifications is slightly different. Countless words have been added, sentences have changed, and even some paragraphs are completely different. What should you do? Would you manually compare page by page, or choose a smart tool to complete complex tasks with one click?
In this case, the difflib
Python library might be exactly what you need. It has powerful text comparison functions that can help you quickly find differences and make the integration process easy and enjoyable.
What is difflib?
difflib
is part of the Python standard library and can be used without additional installation.
This library is composed of multiple parts, mainly providing classes and functions for comparing differences between sequences and calculating similarity. It can be used to compare files, strings, etc., and can generate various reports of difference results, so we can intuitively see the differences.
Since difflib
comes with the Python standard distribution, it supports almost all mainstream Python3 versions. Although this library may not be as famous as other third-party libraries (such as diff in git), difflib
is a very useful and powerful tool when dealing with text comparison and merging. Combined with the simplicity and flexibility of Python, it is still particularly important in many situations. You can complete a lot of text comparison work without leaving the Python environment.
Compare character sequences
SequenceMatcher
is a class in difflib
that can be used to compare the similarity between two sequences (such as strings). It uses the Ratcliff/Obershelp algorithm [1] to calculate the similarity between two sequences.
from difflib import SequenceMatcher
a = """The cat is sleeping on the red sofa."""
b = """The cat is sleeping on a blue sofa..."""
seq_match = SequenceMatcher(None, a, b)
ratio = seq_match.ratio()
print(ratio) # Check the similarity of the two strings
# The output similarity will be a decimal between 0 and 1, in our example it may output:
# 0.821917808219178
Create a difference report
The unified_diff
function can create a “unified difference” report of a string, which is the same format as in many version control systems.
from difflib import unified_diff
diff = unified_diff(a.splitlines(), b.splitlines(), lineterm='')
print('\n'.join(list(diff)))
This will print the difference between the two strings:
---
+++
@@ -1 +1 @@
-The cat is sleeping on the red sofa.
+The cat is sleeping on a blue sofa...
Find the best match
When you have a string and a list, and want to find the item in the list that is most similar to the string, you can use the get_close_matches
function.
from difflib import get_close_matches
words = ["disagree", "discover", "display", "disrupt", "distance"]
best_match = get_close_matches('dist', words)
print(best_match) # Output the list of most similar words
# This will return:
# ['disrupt', 'distance']
Generate HTML difference report
If you prefer a visual comparison report, difflib
provides the HtmlDiff
class, which can be used to generate an HTML document to display the difference between two sequences.
from difflib import HtmlDiff
d = HtmlDiff()
html_diff = d.make_file(a.splitlines(), b.splitlines()) # a,b were defined earlier
with open("diff.html", "w", encoding="utf-8") as f:
f.write(html_diff)
Open diff.html in the browser, it may show as follows:
Practical Exercise
To make the content more in-depth, I have prepared a simple exercise for you. In this section, you will try to use various different functions in difflib
to experience its power.
- Open your Python environment and import
difflib
. - Create two different short text files
text1.txt
andtext2.txt
, write some text with only partially different content. - Use
difflib
to read these two files and print out their unified differences. - Try the
get_close_matches
function to find the best match for a given word in a vocabulary list. - Finally, generate and view the HTML difference report of these two text files.
Through these exercises, you will become more familiar with the functions and use cases of the difflib
library, and can better use it to solve real-world problems.
Summary
In this tutorial, we learned and practiced the difflib
Python standard library, and explored its powerful capability to compare text sequences. Whether it is to compare versions of files or to find the similarity between strings, difflib
can provide a convenient and direct solution. Through pre-learning and practical exercises, I believe you can now effectively use this library to handle many text comparison-related tasks you may encounter. Remember, practice is the best way to learn, so don’t hesitate, swing your keyboard, start creating some magical scripts, use difflib
to solve problems, and make your work more efficient.
I hope this tutorial inspires you, and I wish you go further and further on the road of programming. If you have any questions about difflib, you might as well read the Python official documentation [2] to get more exquisite details.
References
[1] National Institute of Standards and Technology. Ratcliff/Obershelp pattern recognition.