Analysing Subtitles with Natural Language Processing

Richard Ashworth
ITV Technology
Published in
6 min readSep 16, 2021
Scrabble letters arranged to form the phrase “climate change”

What does our content say about climate change?

Albert, a sustainability project adopted by BAFTA, helps the UK’s major broadcasters answer this question through their Subtitles to Save the World initiative. In this post, we’ll look at how data at ITV is used to support this, and how we can expand on the analysis using Natural Language Processing (NLP).

Elasticsearch and Kibana enable different users to search for particular words or phrases within our subtitled content. For example, a clip sales agent might want to find programmes that reference a particular location, or we might be interested to see how frequently people talk about climate change on screen:

Searching subtitle data for mentions of a phrase using Kibana

Subtitle data is also available as a BigQuery dataset, which enables further analysis by combining it with other data sources. For example, we can use catalogue data to determine which genres contain the most mentions of climate change:

Mentions of climate change by Ofcom genre (taken from a sample dataset)

Similarly, we can use this data to identify trends in our content, comparing the frequency of ‘Covid’ and ‘climate change’ in programmes broadcast during 2020:

Frequency of different words in our content over time (taken from a sample dataset)

Limitations and the need for NLP

When we know what we’re looking for, Kibana and SQL queries work well. When we don’t know what to expect in our results, things get more difficult: for example, we might want to identify which people are mentioned most frequently in the context of climate change.

This is where Natural Language Processing comes in. In the following examples, we’ll show how the open-source spaCy and TextBlob libraries can be used to perform this analysis.

Installing spaCy and loading models

After installing spaCy as a python package, we’ll need to provide it with some trained data, known as a model. This is used to identify and categorise entities in our input text, and although it’s possible to customise this, the general-purpose en_core_web_trf model is sufficiently accurate for our needs.

Named Entity Recognition

With spaCy now set up and the model loaded, we can use the library to tag different entities that appear in our subtitle text. For example, using the following sentence:

"We can speak to the UK’s own Greta Thunberg, because Amy Bray is a climate change activist from Cumbria, who is getting great acclaim for saying young people should be very worried about their future.”

we can invoke spaCy’s ents method to extract the named entities and render them using the built-in displacy visualiser:

As well as identifying people and places, spaCy supports a host of other entities, including products, organisations and even works of art! See here for a complete list.

Tokenisation

Given the fragmented nature of subtitle segments, it can be difficult to infer the surrounding context from an individual segment. Treating the subtitles for a particular programme as a single document and then splitting (tokenising) this into sentences makes analysis far easier. This can be done using spaCy’s sents method. For example, given a set of subtitle segments from an episode of Jonathan Ross:

we can run the following code to convert this into sentences:

Lemmatisation

Converting words into their canonical form (lemmatisation) is also useful when searching subtitles for a particular phrase. For example, we may be interested in segments containing any of “climate change”, “ climate changes”, “climate-changing”, etc. To avoid having to include all the inflected forms of a word in our search query, we can generate the lemmas for each excerpt in our subtitles and pattern match these with our search terms instead:

Spelling correction is also something we could use here to further improve search results.

Analysing Sentiment

An interesting area of NLP is sentiment analysis, widely used to track customer reviews and comments on social media. We can use TextBlob to derive a score between -1 and 1 to reflect the sentiment in a piece of text. This helps us measure the tone of our content, as well as analyse the language being used to describe different entities. For example, in the following sentence:

“It’s great that people are talking about climate change”

we can assess the sentiment using the code below:

Sentiment(polarity=0.8, subjectivity=0.75, assessments=[([‘great’], 0.8, 0.75, None)])

These assessments provide a crude measure of our tone of voice, but care is needed when drawing inferences. In the above example, the adjective ‘great’ yields a positive score, yet it is not being used to describe our search term, “climate change”. A deeper analysis of the text is needed to extract the language used to express sentiment about a specific word or phrase.

Using Parts-Of-Speech and Dependency Tags

To perform such a task, we need to examine the grammar of the sentences containing our search terms. We can do this with spaCy using parts-of-speech (POS) and dependency tags:

We can now use the POS tags to extract adjectives from our sentence, and determine which nouns they describe using the dependency tags:

Applying this function to our example sentences enables us to differentiate the use of ‘great’ in the first sentence (which is not being used to describe climate change) from the “rapid” and “bad” in the second. A ‘relevant sentiment’ score can then be calculated using only those assessments that are connected directly to “climate change” in the dependency graph:

Bringing it all together

Since named-entity recognition, tokenisation and dependency tagging are not performed in the context of a particular search term, it is more efficient to do this ahead of time and persist the results. We’re using Google Dataflow (a runner for Apache Beam pipelines) to apply some of the functions described above to our subtitle data.

With the results of this analysis at our disposal, we can harness the techniques described above to answer a broader set of questions about our content. Some examples:

Which people do we mention most frequently in the context of climate change?

Results from a sample dataset

Which places do we mention most frequently when we talk about climate change?

Results from a sample dataset

What language do we use when discussing climate change on screen?

Results from a sample dataset

Summary

Although analysing natural language is a difficult problem, NLP libraries like spaCy and TextBlob provide us with powerful tools. As we have shown in the examples, these tools can be leveraged to glean deeper insights from our content through entity extraction, dependency tagging and sentiment analysis.

As well as investigating how different topics are represented on-screen, applying NLP to subtitle data could also be useful in other applications: a compliance team may want to check which entities appear in our scheduled content following a particular news story, for example.

In this post, we have only scratched the surface of what can be achieved with sPacy and TextBlob. Both libraries come with excellent documentation, and for more advanced examples and tutorials, I would thoroughly recommend the following resources:

--

--