Building a python-word application, step by step
A friend of mine is an in house counsel. One of his jobs is to maintain a repository of contract clauses. When he comes across a new clause, he adds it to the existing list. Sometime he will come across a clause that sounds familiar but is unsure if it’s already in his list and just worded differently. He’s asked me if I know of any application that would allow him to fuzzy search if a given paragraph exists in a list of paragraphs. I have built this application before and in this article I will walk you through some of the key libraries I have had to use. When I built this application, I found that I had to refer to many different sources to get the application working. While we will go over how to build this exact application, my aim in this article is to provide you with enough details so that you can build your own application without looking up how to solve some of the technical challenges I had to go through.
Road-map
I will first go through how to iterate through each paragraph within a word document using python. I will then explain how to fuzzy match one paragraph against another. Finally, we will create a dialog box for the user to interact with and wrap all the python modules that we used into a single executable file so that the user will only have to interact with the dialog box and not with the command prompt. Overall, there are three python modules that I created for this application and they are:
- button (creates a dialog box for the user)
- main (receives the inputs from the button module and passes it on to the last module)
- find (finds the relevant paragraph and highlights it in the word document)
The libraries that I used in creating this application are:
- python-docx (this library will allow us to code in python and make changes to word documents),
- diff-match-patch (this library will help us perform fuzzy search),
- tkinter (this will be used to create a graphical interface), and
- pyinstaller (this library will allow us to create an executable).
Any code mentioned in this article can be found at my github https://github.com/Allen8838/legal-technology. I have also provided additional links at the bottom of this article for further reading.
Let’s begin!
python-docx
As the name suggests, python-docx serves as an interface between python and word document, allowing us to manipulate word documents using python code. Our overall strategy is relatively straightforward, iterate through each paragraph in the document and then highlight any paragraph that fuzzy matches with the given paragraph. The iteration and the highlighting is what we will be using python-docx for. In order to use python-docx, you will need to install it. You can do so via
pip install docx
Then create a document object as shown in the image below by passing in the filename you are working with.
The other lines of code in the image above are receiving the inputs specified by the user. We will come back to this part later in the article.
diff-match-patch
The workhorse library that will tell us whether one paragraph fuzzy matches with another paragraph is called diff-match-patch. This library will also need to be installed with
pip install diff-match-patch
Unlike the usual python libraries where you can import it into your module and use it, you will also need to provide the python path where you installed diff-match-patch. Here is the main module followed by the find module.
To briefly go over the code in the find module, line 11 creates the diff_match_patch object. Lines 13 and 14 receives the match threshold and match_distance specified by the user. The match threshold is a number from 0.0 to 1.0 where 0.0 means we want the two paragraphs to match exactly and 1.0 means we will match anything with anything. The match distance measures how far away we are willing to tolerate the characters from the two paragraphs to be away from each other. It is a number between 0.0 to 1000.
tkinter
We will be using a library called tkinter to create a graphical user interface. One of the nice things about this library is that it is a standard GUI library in Python, so you won’t have to install it if you already have Python. I would like to create the following dialog box for the user.
Let’s start by talking about the input boxes, the boxes with the grey texts. The input filename, fuzzy match threshold, match distance and output filename were created from Entry boxes widget while the large box in the middle was created from a Text box widget in tkinter. The main difference between these two objects is that the entry box only allows one line of string whereas text box allows multiple lines.
The title line that says “Compare one para vs paras of Doc for duplicates. Fuzzy” was created using something called a LabelFrame. The LabelFrame object is used to demarcate and group objects together (notice the lines that extend to the left and right of the title line). This would be especially useful if you have say, a series of three radio buttons corresponding to one question that you want to group separately from another series of radio buttons corresponding to another question. To create the input boxes and the LabelFrame, we will need to first initialize them. The code to do so is found below.
Notice that I have inserted “labelframe” into each Entry and Text boxes. This will tell LabelFrame which objects belong together. For the Entry boxes, the ‘0’ placed after the insert method is an index parameter that says where the user text should be placed in the widget. This index parameter is more relevant in the event that there are already text in the input boxes and we want to make sure that we place the user text after any text we placed. For our purposes, this will not really matter as we will create an additional functionality that deletes any text in the input boxes when the input box is clicked on.
To further explain what I would like to do, I would like to create the effect where if the user clicks on an input box, the default string disappears and if the user clicks on another input box without inputting anything into the prior input box, the default string reappears. This will help guide the user on what to input in case they end up clicking around the input boxes. In order to execute this, we will need to take advantage of what tkinter calls “Events” and “Bindings”. Events are actions that a user takes such as typing on a keyboard or clicking on a mouse. Bindings are functions that we build to associate with an event. We can use the first input box as an example to explain events and bindings. Tkinter defines clicking on an input box as “FocusIn” event and clicking outside of the input box as a “FocusOut” event. We would have the following events and bindings.
On_input_entry_click and on_input_focusout are functions that I wrote and are defined as follows
The code should be self explanatory but just to go over them quickly, the function on_input_entry_click will clear out anything in the first input box if the user clicks on that box. If the user clicks away from the first input box without inputting any text, the on_input_focusout function will take over and re-insert the default text. The other entry and text boxes have similar events and bindings.
We are almost done with creating the GUI! We will need a line to create a run button. Intuitively, the command to create a button is called… Button!
Finally, to complete the GUI, we will need to have a line called mainloop.
This creates the GUI application and will respond to any events until the user closes the application. Said differently, it really is a condensed while loop that will continue to wait for and respond to events and will only break until the application is closed.
We will build a function to collect the user inputs and pass these inputs to our main module which will run our find module. This will be the lines of code that we had briefly mentioned when we introduced python-docx.
pyinstaller
If we wanted to create an executable for our user so that they can simply double click on a file to access the GUI rather than running the script through the command line, we can use a library called pyinstaller. In our case, we will need to tell pyinstaller that our script depends on three modules: button, main and find. In the command line, we will go into the folder where our modules are stored and type the following
pyinstaller.exe --onefile --windowed button.py main.py find.py
The “onefile” argument tells pyinstaller to create just the application file in the resulting dist folder. Without this argument, you will see other back end files. The “windowed” argument suppresses the command line from being opened when you double click on your application. Your folder should now looking something like this.
We are getting close to the finish line! The one last step you will need to get your application working is to modify the SPEC file. You will need to tell pyinstaller explicitly that your program relies on python-docx. To do so, open the SPEC file and input the following into the “datas” line
[(path.join(site_packages,”docx”,”templates”), “docx/templates”)]
Your SPEC file should look something like this
And that’s it! You now know how to create your own python application for word documents.
If you found this article helpful, please show some love and give a clap!