Research on the quality of localization of movie titles

6 min readNov 5, 2022

Do movie title distort during the process of movie localization? How similar are original titles to their “translated” versions? Let’s find out!

All the people around are talking about their pet-projects, and I’ve decided to have one. That’s how it all started.

It happened so that I’ve chosen to analyse information about shows (movies/series).

How to improve skills using pet-projects by Mark Tenenholtz

Following the advice from Mark Tenenholtz, I’ve scraped the data, cleaned it and started exploration.

Scraped dataset contains the next information:

Show ID
Russian Title
Original Title
Actors
Show Info
Ratings
Synopsis
Critic’s Scores

The first thing that came to my mind was to find the differences between Russian and Original titles for every show, because localizers (in every country, I suppose) sometimes tend to approach the process of title translation in a creative way, losing the original meaning of the title.

At the very beginning of this analysis, I set to myself two tasks:

How similar are Russian titles and original titles in general?
Is it possible to split dissimilar pairs (russian_title :: original_title) into groups according to the root cause?

Keeping in mind the goal of the research, I’ve started the analysis.

Reading data

First part of the process is always reading the data

movie_df = pd.read_parquet(
    os.path.join(relative_path, "movies_info.parquet"),
    columns=["russian_title", "original_title", "country"],
)

The dataset looks like this:

At the beginning we have 984 movies.

Data preprocessing

Filtering based on the country

During the analysis I’ve found out that movies from countries like Japan, South Korea and so on have transcriptions from hieroglyphs as original title.

Some movies have original titles, for which we cannot produce quality text embeddings

After some experiments with the data, I’ve decided to drop such rows:

filters = ["Япония", "Корея Южная", "Гонконг", "Китай"]movie_df = (movie_df.loc[movie_df["country"].apply(lambda country: check_countries(country, filters))] if "country" in movie_df.columns else movie_df)movie_df = movie_df.drop("country", axis=1, errors="ignore")
movie_df = movie_df.replace(r"^\s*$", np.nan, regex=True)

And delete country column, because we won’t need it.

movie_df = movie_df.drop("country", axis=1, errors="ignore")

We are left with 877 movies.

Now, we see several problems with the dataset.

Russian titles have the movie release year — we need to remove it.
Not all movies have original_title, because some movies are Russian and have only Russian title — we don’t need them in our analysis.

Cleaning Russian title

Approach 1: Removing last 6 characters

In the previous section, we found out that the russian_title contains the release year of the film.

Let’s check that the last six characters from the russian_title are always the same and look like (year).

six_chars = movie_df["russian_title"].apply(lambda s: s[-6:].replace("(", "").replace(")", ""))six_chars.value_counts(ascending=True).iloc[:15]

Counts of whitespaces inside parenthesis for every russian title

We can see that amongst the most infrequent years, there are little errors — year contains additional white spaces.

I’m going to check the whole title for this case.

indices = [i for i, year in enumerate(six_chars.values) if " " in year]movie_df.iloc[indices]

Rows, for which the first approach failed

Aha!

Approach 2: removing the whole parenthesis

Let’s switch to another strategy — finding out whether every title contains substring like (smth) and if it is true, then remove such substring.

Every title contains some information in brackets — we don’t really care what’s inside them. Our goal is to clean the titles, so, we’ll just delete the brackets with their contents.

movie_df["russian_title"] = movie_df["russian_title"].apply(lambda s: re.sub(r"\([^()]*\)", "", s).strip())movie_df["russian_title"]

Checking for missing values

movie_df.isna().any()

We can see that the original title column contains NaN values.

Probably, because not every movie has an original title — maybe movie is Russian-made and doesn’t have an English-translated title, for example.

I think it is reasonable to drop such rows.

movie_df = (movie_df.dropna(axis=0, inplace=False) if movie_df.isna().any().any() else movie_df)

We are left with 744 movies. And we are ready for the analysis.

Semantic similarity

We are going to use multilingual models from SentenceTransformers framework.

distil_use_v1 = SentenceTransformer("distiluse-base-multilingual-cased-v1")
distil_use_v2 = SentenceTransformer("distiluse-base-multilingual-cased-v2")
minilm = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
mpnet = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
labse = SentenceTransformer("LaBSE")

The quality of embeddings which I’ve used were evaluated here by David Dale. Almost all of them are at the top average performance on sentence tasks, so for now I will limit the analysis to them.

models = {"distiluse-base-multilingual-cased-v1": distil_use_v1,"distiluse-base-multilingual-cased-v2": distil_use_v2,"paraphrase-multilingual-MiniLM-L12-v2": minilm,"paraphrase-multilingual-mpnet-base-v2": mpnet,"LaBSE": labse,}

Before moving on, I’d like to check the sanity of the models — assess how well they predict the similarity between Russian and the original title.

similarities = {}
for model_name, model in models.items():
    similarity_df = get_similarity_dataframe(model, russian_titles, original_titles)
    similarities[model_name] = similarity_df["similarity"]

    print(model_name)

    with pd.option_context("display.max_rows", None, "display.max_columns", None):
        display(similarity_df.sample(10, random_state=SEED))
        
    print("\n")

Code above prints the name of the embedding model and displays 10 random rows from the dataset with a new column similarity, which contains the similarity between the Russian and the original title.

We have 5 embeddings models, which means that we can calculate 5 similarity scores for every sentence. Which one should we choose?

I’ve decided to define a single similarity score as a median of all 5 similarity scores.

Calculating single similarity score

similarity_df.drop("similarity", axis=1, inplace=True, errors="ignore")

for model_name, similarity_col in similarities.items():
    similarity_df[model_name] = similarity_col

similarity_df.insert(2, "median_sim", similarity_df[similarities.keys()].median(axis=1))

Results

After we’ve calculated the similarity score, we can analyse what we’ve got.

Our first question was

How similar are Russian titles and original titles in general?

I think we can answer this question.

similarity_df["median_sim"].describe().round(2)

Descriptive statistics for similarity score

We can see that the title similarity distribution skewed left. Average similarity is equal to 0.73 (median is 0.78).

It means that, on average, the titles are somewhat similar, but there are cases where similarity is very low.

And that brings us to the second question

Is it possible to split dissimilar pairs (russian_title :: original_title) into groups according to the root cause?

I’ve narrowed down dissimilarity cases to three main ones:

Russian title is a cropped version of original title
Another problem in this case can be the fact that embeddings don’t work very well with proper names like Borat :: Борат, Dolittle :: Дулиттл, and so on.
Examples:
— Борат (Borat) :: Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan
— Веном 2 (Venom 2) :: Venom: Let There Be Carnage
— Бёрдмэн (Birdman) :: Birdman or (The Unexpected Virtue of Ignorance)
— Амели (Amelie) :: Le Fabuleux destin d’Amélie Poulain
Russian title is an extended version of original title
The remark about proper names applies to this case too.
Examples:
— Удивительное путешествие доктора Дулиттла (The Amazing Journey of Doctor Dolittle) :: Dolittle
— Пол: Секретный материальчик (Paul: Secret material) :: Paul
— Рапунцель: Запутанная история (Rapunzel: Tangled) :: Tangled
Russian title was localized (made up) by translators/localizers
Sometimes it is better to localize the title due to cultural and other peculiarities, but sometimes it goes too far.
Examples:
— Невероятный мир глазами Энцо (Incredible world through the eyes of Enzo) :: The Art of Racing in the Rain
— Человек, который изменил всё (The man who changed everything) :: Moneyball
— Области тьмы (Areas of darkness) :: Limitless
— Одинокий волк (Lone wolf) :: Clean

Code to perform the research can be found here: