Primal Data Advent Calendar #9: Using Google Sheets for labeling “text diffs”

Marek Šuppa
Slido developers blog
9 min readDec 23, 2020

This piece is a part of the Primal Data Advent Calendar, a series of miscellaneous articles written by the Slido Data Team. We release one on each Advent day that happens to be a prime (like 23 for the 23rd of December). Enjoy!

A real-life Data tale, in which the versatility of Google Sheets combined with the power of Python, and specifically Pandas, save the day. (Image adapted from iconarchive.com and pinclipart.com)

I don’t particularly like Google Sheets.

Not only that, I don’t really like Excel (or its open equivalents) either. Partially because I am not very good at using it but mostly due to how it ends up being used on day-to-day basis. In and of itself, the idea behind a spreadsheet processor as an abstraction over tables, rows and columns does have merit. But as the proverbial everything-smashing hammer, it ends up being used for anything even remotely data-related: from checklists through agendas, to reports of all kinds. As the modern incarnation of the very same abstraction, Google Sheets are not much better in this regard. On the contrary, it’s probably even worse.

Here is why: Google Sheets have become the production. On some level, the reason is obvious: it’s readily available, relatively cheap and (most importantly) already approved by internal IT in pretty much any company that uses GSuite. Not only that, it also tracks edit history and enables collaboration out of the box.

Still, when you read what craziness people do with it (like controlling Kubernetes clusters), you can’t help but wonder “why would anyone do that”?

Well, it turns out even we at Slido did something similar, for a pretty good reason (at least so we believed). Let me tell you how it came about.

Developing Slido’s Edit Question feature

As a Q&A platform, Slido receives a ton of (short) questions form multitudes of participants. These can then be upvoted by the audience, which allows the organizer to better asses which ones to focus on during the Q&A.

But even at 300 characters, there are times when you’d wish to be able to edit the question after it has been sent. Not because you changed your mind but because you would not want to allow a distracting typo to prevent your question from picking up traction and being addressed. This problem is especially pronounced when we consider that most questions come from mobile devices, known for autocorrect fails.

Combining all of these together (along with a ton of customer feedback), it became obvious that there is a need for an “Edit question” feature, which would allow one to make small edits (like a typo fix) to the question that has already been sent.

How would we go about implementing it though? If changing the whole question would be allowed, all sorts of security-related issues would pop up. For instance, it is not difficult to imagine a question getting a ton of upvotes, only to be changed right before getting addressed, with the organizer in being left in utter confusion. Clearly, some boundaries would need to be set. One obvious solution would be to send the question through moderation and let the organizer approve the change. But that would also add quite a bit of work to the already stressful role of a Q&A organizer.

To strike a balance, we decided to model the task as a binary classification problem: given an original and an edited question, decide whether the “difference” between them is small enough to preserve most of the original meaning (small edit) or too large and potentially disruptive for the quality of the Q&A session (large edit).

Since our initial discussions on what actually constitutes a small change failed to lead to a conclusive answer, we further decided to take a data-informed approach: we would have our colleagues label some of these question edits as small or large and try to extract a classification rule from the labeled data.

But how do you go about labeling what are essentially “textual diffs”? It is in here where (unexpectedly) Google Sheets ended up saving the day.

Labeling “textual diffs”

Since we decided to model this “textual diff” task as a classification problem, it would be only natural to use data labeling tools for text (or document) classification.

There are more than a few tools that do so, such as

  • Doccano (very popular opensource text annotator)
  • SMART (an opensource annotation platform we had good experience with when labeling Question Sentiment)
  • Label Studio (opensource data labeling tool; free for single-person use, with enterprise offering)
The Doccano interface for Document Classification. If we managed to find a way of visualizing the “textual diffs”, this could probably work. (Image from the demo)

While they all seem quite interesting, they also suffer from a common set of problems:

  1. Some infrastructure work would be necessary for setting them up (even if it amounted to spinning a Docker container somewhere).
  2. User accounts for labelers would need to be set up manually (or pseudo-manually).
  3. Custom work would be necessary to visualize the “textual diffs”.

Of the three above, the last one was a show stopper for us. Interpreting how big of a change the the edit introduce was at the core of our task and our assumption was that the quality of the visualization of the diff would have direct impact on the quality the resulting labeled data.

Since it became obvious that an off-the-shelf solution won’t do the trick, we turned to the platform we knew had at least the first two problems figured out by default: Google Sheets.

Visualizing “text diffs” in Google Sheets

Since we established that some custom work would be required, we tried to figure out the cheapest way of doing so. Fortunately, we found that the combination of the Python’s builtin difflib library and xlsxwriter , which can help us create Excel .xlsx files directly from Python (and Pandas), would let us create a fairly pleasing visualization of “text diffs” quite easily.

Let’s go through the whole process step-by-step.

Generating “text diffs” with difflib

In order to visualize the differences between the original and the edited questions, we first need to figure out what has actually changed. The difflib library allows us to do just that, thanks to its SequenceMatcher . The Gist below by Ines Montani shows how we can easily iterate over the changes that took place between the two strings that are being compared:

Note that the SequenceMatcher has quadratic time complexity in the worst case but since we are dealing with quite shorts texts (300 characters at most), this should not have practical consequences.

As we can see in the Gist above, difflib does most of the heavy lifting — we just need to somehow visualize the insertions/deletions and replacements that take place. The code above does that by colorizing the console output with wasabi, but in our case this needs to be visible in the resulting Excel spreadsheet. Thankfully, xlsxwriter has us covered.

Writing “rich” strings to Excel sheets with xlsxwriter

Being able to add formatting to Excel spreadsheet cells must be one of its most often used features (if you’d happen to have hard data on this, please do let me know). It is therefore no wonder that xlsxwriter supports this “rich” formatting out of the box as well.

For instance to produce an Excel spreadsheet with the text “This is bold and this is italic” in the A1 cell, we could do something as like the following:

If we’d then correctly save the worksheet to rich_strings.xlsx and open the file in Excel, we may see the following:

A sample Excel sheet with rich formatting in the A1 cell. (Image from the documentation)

That’s already quite close to what we’d like!

Note that on lines 5 and 6 the bold and italic formats get defined and on lines 10 and 12 we can see them being used on the strings that immediately follow them in the list of parameters for the write_rich_string function.

To better denote differences between additions and deletions, we would probably like to include green and red colors as well. As this very nice xlsxwriter example shows, however, that ought to be easily doable by calling the add_format function with an appropriate set of parameters (namely color).

Putting it all together with Pandas

With both the “text diff” as well as “rich” formatting of Excel spreadsheets at the ready, all that is left is to put it all together.

The data we work with is normally processed with Pandas, which is what we’d like to use in our usecase as well. We are in luck, because Pandas has the so called ExcelWriter class which allows us to get access to the underlying xlsxwriter instance and add rich formatting as we described above.

(Note that what we are about to do is very experimental and not actually fully supported by Pandas, as the ExcelWriter documentation explicitly says that None of the methods and properties are considered public. For our purposes, however, we are willing to risk it.)

As a first step, we’ll import Pandas and create a sample DataFrame. In our case it will have only three hard-coded question pairs, but in principle the DataFrame can be easily read from any source that Pandas supports (such as a .csv , .xlsx or even .sqlite file).

In the next step, we’ll use ExcelWriter to get access to the xlsxwriter ‘s workbook and worksheet , which will allow us to actually apply “rich” formatting to the cells in the Excel spreadsheet. At the same time we’ll define the formatting options we would like to use in the resulting Excel spreadsheet.

As we can see on lines 2 and 3, we actually already used the Pandas to_excel function that “flushes” the data to an Excel spreadsheet. This has happened only virtually, however, and lines 5 and 6 allow us to get access to the workbook and worksheet of the resulting spreadsheet. These will allow us to add a specifically formatted column in the next step.

Before we do that, however, we will need to update the diff_strings function from above to use the formatting options we defined on lines 8 to 12 above. That’s relatively straightforward, as we can see below:

As we can see, the function again generates a list consisting of specific formatting marks (such as green , red or default ) and parts of texts which were inserted, deleted or replaced. The function now returns a list, which is on purpose: it will allow us to apply it in a nice Pythonic way when calling write_rich_string.

That’s actually what the final part of our quick script would do. It could look as follows:

The code is again very straightforward: it just iterates over each row in the DataFrame and puts the visual representation of the diff, provided by the visualize_diff function, to the third column (the second argument for the write_rich_string function — the column IDs are zero indexed). Since the first row is reserved for the header (i.e. the column names), the row ID is being incremented by one (the first argument for the write_rich_string function). Finally, note that since the visualize_diff function outputs a list, we use Python’s asterisk (*) to unpack it for a function call.

Once all of this is done, all that’s left is to save the result to a .xlsx file. When we then open it in Google Sheets, we’ll see something like the following:

The final Excel spreadsheet opened in Google Spreadsheets. Feel free to check it out live here.

Not bad at all!

As we can see in the third column, we are dealing with addition ( abt was fixed to about) in the second row, there is a replacement going on in the third one ( quarter changed to year ) and the second sentence of the question got removed in the last one.

Sky is the limit from here on. We would probably add at least some more columns and possibly automate the generation of multiple sheets at once. But we can already count on having figured out the most important part: how to provide a readily available interface for visualization of “text diffs”.

Conclusion

There is no question that spreadsheets get overused — often to a point where they are the biggest competitors to SaaS businesses. At some point, it almost always makes sense to invest into a specialized application. But if you are faced with a one-off problem like deciding on an implementation detail of a product feature, a quick and dirty solution involving Google Sheets (which will give you many of the features of a specialized application essentially for free) maybe the best tool for the job. And if your task happens to involve the comparison of “diffs” of short texts, we hope this article has shown that doing so with Google Sheets, Pandas and Python’s varied ecosystem of libraries can end up being quite doable.

PS: Despite modeling the task as a classification problem, we ultimately did not end up using a machine learning model but a method from the 1980s called weighted Damerau–Levenshtein distance. More on that in a future article.

--

--