Levenshtein distance as a remedy for sequential data

Published in

Datasparq Technology

9 min readJan 20, 2022

Imagine a large web portal is reaching out to you to create an AI solution that will be recommending the most relevant articles to users. That’s exactly the task we have been tackling in Datasparq recently. We needed to determine which articles had multiple parts, and what order they should go in based on the titles alone so that the next article in the sequence can always be recommended.

But articles are not the only example where you can come across sequential data. Consider movies and TV series for instance — they would not have produced Avatar 2, 3, 4 and 5 if the first film had not been a success. In this article, we are going to replicate the solution we devised and create a BigQuery table of movie sequels so that we know which movie follows the preceding one e.g. Avatar 2 comes after Avatar and Star Wars: The Rise of Skywalker is the continuation of Star Wars: The Last Jedi.

We use “IMDb movies extensive dataset” available on Kaggle: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset, which contains over 85,000 movie titles. This massive dataset comprises four smaller chunks, namely IMDB movies.csv, IMDB names.csv, IMDB ratings.csv, and IMDB title_principals.csv. We are interested in the latter because this is the one that contains the titles that we will parse and from which we will extract sequences. Having said that, let’s get started!

How would you know these two movies are related if there was not any text on the posters?

Setting things up

To analyse this dataset, we will be using BigQuery which is one of the key GCP tools. Although this is a paid service, everybody is eligible for $300 credit to explore this massive toolset (which is way more than enough for the purposes of this project).

To load the data, we first need to create a dataset in which we insert the table. Specifying the location, choose the one which is the closest to you, so that data latency will be the lowest, and queries will be executed faster. We will be creating a table from the .csv containing all data. For convenience, it is best to upload this file into the dedicated bucket on Google Cloud Storage (GCS).

Letting it figure out the schema automatically will probably raise an error because some of the columns do not have consistent data. The (not-ideal) solution we are going to apply now is to set these columns with the type of STRING. To save your time defining everything manually, just copy and paste the following to Schema as text.

imdb_title_id:STRING,
title:STRING,
original_title:STRING,
year:STRING,
date_published:STRING,
genre:STRING,
duration:INTEGER,
country:STRING,
language:STRING,
director:STRING,
writer:STRING,
production_company:STRING,
actors:STRING,
description:STRING,
avg_vote:FLOAT,
votes:INTEGER,
budget:STRING,
usa_gross_income:STRING,
worlwide_gross_income:STRING,
metascore:STRING,
reviews_from_users:FLOAT,
reviews_from_critics:FLOAT

Levenshtein distance

Now we are moving to the cornerstone of this article — Levenshtein distance. In the simplest words, this metric tells us how much one text string is different from another i.e. how many characters do we need to change in one string to make it identical to the other.

Unfortunately, there is no ready-made Levenshtein distance function to use in BigQuery so we need to create one on our own. BigQuery supports creating user-defined functions (UDFs) written as SQL expression or JavaScript code. Luckily for us, Andrei wrote the Levenshtein distance function a couple of years back and published it on GitHub under the MIT License: https://gist.github.com/andrei-m/982927. We are going to use his bit of code. However — we will implement a few changes.

Currently the code counts the exact Levenshtein distance, regardless of how big it could be. Considering the huge amount of records present within the dataset this would be a very computationally heavy process involving lots of time to run the whole query. Moreover, this would consume a much larger amount of resource which would consequently lead to higher bills to pay for GCP (or to exhausting the free $300 credit much quicker).

The thing is, we do not really care about the exact Levenshtein distance number because as soon as it crosses the determined boundary, it is beyond our limit anyway. To address that, and significantly speed up the query execution, we are going to return the max set boundary for the given title, as soon as it is crossed. Implementation of these changes will allow us to execute the code approx 100x faster!

Data cleansing

To speed up performance and increase readability, we don’t want to select all the columns from the dataset but rather focus on these (which can be useful in the later stages). The most appropriate ones in our case will be imdb_title_id, title, director, and date_published.

After a quick dataset exploration, we realise the sequence is expressed in several ways. Below are a few examples.

Figure 1 — sequence at the end of the title

Figure 2 — sequence written as a roman numeral

Figure 3 — sequence in the middle

Figure 4 — sequence at the beginning of the title

Figure 5 — sequence as the year at the end

Data engineering involves a lot of customisation of the data we work with. To extract all sequences from the various titles, we will need to adjust our query to catch all of the cases.

As was already mentioned, Levenshtein distance checks how many characters we need to change in one string to make it identical to the other one. What we strive for, is to find the golden mean — in terms of how many characters we allow to be different without having a negative impact on our final result. We do not want to set it too high, as we do not expect the next sequel to have a totally different title. It’s just the sequence that will be different. We also want to minimise the number of comparisons, to avoid extra costs associated with the query processing and to make it run faster.

And here comes the problem… If we do not set the distance high enough, we will miss plenty of movies that are actually sequential e.g. those in Figure 3, which list the subtitle as well as the title. Fortunately, there is a simple workaround we can use in that situation. We just have to REPLACE() these substrings with empty strings so taking Figure 3 as an example; “Destination Dubai” will be deleted leaving just “The Wedding Party 2”. The solution works perfectly fine, removing what is unnecessary and just keeping the string with a valuable number.

Roman numerals to Arabic numbers:

Here comes another tricky bit. We cannot just catch the sequence written as a Roman numeral and REPLACE()it with the Arabic one because this will only pass muster in a very simple situation e.g. when the roman numeral is one of I, V, X, etc. What can we do then? Well, this is another time we can make use of UDFs and create such a method that will locate the Roman numeral, find an Arabic representation of it, and finally replace it. That is exactly what Mikhail did in this StackOverflow post:

What is the best way of converting roman numerals to arabic ones in bigquery?

As in the title. The best thing that comes to my mind is using basic REPLACE function, but it fails with anything more…

stackoverflow.com

After a little modification of Mikhail’s code to suit our needs, this is what we are going to use:

and subsequently this…

Sequence extraction:

Contrary to what you may have thought about extracting sequences, we do not need to define another fancy UDF :) Multiple CASE statements with a simple Regex do this job pretty well. As mentioned before, they just need to be properly adjusted to catch every way the sequence is expressed.

Levenshtein distance in practise:

As we already have the Levenshtein distance method, it is now time to use it in practice. To be able to compare one text string to another, we need to double the data by joining a table to itself using CROSS JOIN.

This operation can be very computationally heavy and take ages to run, hence the query optimisation is key, and that is why we did not want to count the precise value of Levenshtein distance once it crosses our max_boundary.

Although by using INNER JOIN, we could have the result-set produced quicker, we would need to specify what we should join the tables on. The only options would be either on director or writer. However, there are lots of sequels where both of them are different. Joining tables based on them may not fully represent our result-set, as we would probably just obtain its fraction. We would also not join the tables based on movie ids, as we would not want to compare movies to themselves.

As was said before, there is no one-size-fits-all value for Levenshtein distance, and it should always be determined with regard to the data and the problem we are working on. For this specific case, I found allowing up to 3 characters to be different a reasonable number but I highly recommend experimenting with this value on your own, if only to see how the final output can be dramatically different!

Determining sequentiality:

To clearly indicate whether the movie is a subsequent sequel, it would be nice to add a column and express sequentiality in the form of a simple Boolean statement; TRUE — this movie is the continuation of the previous one, FALSE — no it’s not. To do so, we will just need to form specific case statements, of how the sequentiality should be determined based on seq_1, seq_2, and next_by_date.

We know for sure, in the case where seq_2 is greater by one than seq_1, we want to mark this movie as the next part. Not having first parts explicitly marked as 1, slightly complicates this task and forces us to form a slightly more complex WHEN statement: WHEN seq_1 = 0 THEN seq_1 + 1 = seq_2.

Although this should be enough in most cases, a subset of such movies may exist where the titles are near identical, the movies have been produced by the same director, and contain a sequence in the title. This is a dangerous case, where a totally different movie may be declared as a continuation of another one.

To prevent that, we are going to use next_in_sequence parameter which tells us what the subsequent movie produced by the same director was with respect to the date. To make things clear — for sequence determination, we cannot only rely on this parameter. There may have been a director who produced a couple of other movies in the gap between the sequels of the given one. This parameter, however, works well as an additional layer of certainty where the next sequel is the correct one. After incorporating it, our statement looks like this:

WHEN seq_1 = 0 AND next_by_date = 1 THEN next_by_date + 1 = seq_2 OR seq_1 + 1 = seq_2

Finally, if neither of these two statements above returns TRUE, we want to classify that movie as non-sequential. Expression seq_1 = next_by_date will do the job.

In the end our code should look something like this:

Final result:

Once we finally have all the bits done, we can combine the puzzles using WITH AS clause and produce the table. In the final SELECT statement, we are going to select all columns from the last subquery, determine the Levenshtein distance value and eliminate those rows that have empty strings in their titles (arising out of our previous REPLACE statements). For readability, we can also ORDER the records BY title_1.

Our table of sequential movies should look something like this and has well-over 1000 records: