How Data Analytics Can Help You to Learn a New Language Faster

Fernando Campelo
8 min readSep 27, 2022

--

Figure 1 — HIMYM Spanish subtitles

Introduction

Have you heard about the 80/20 theory? it states that everything in the World is imbalanced. According to the Pareto principle, for many outcomes, roughly 80% of the consequences result from 20% of the causes. In other words, a small percentage of causes have a disproportionate impact. This concept is important to grasp because it can assist you in determining which initiatives to prioritise in order to have the greatest impact. It is easier to get the idea from the examples:

  • 80% of the World’s wealth is held by 20% of the people
  • 80% of results are caused by 20% of thinking and planning
  • 80% of retail sales are produced by 20% of a store’s brand
  • 20% of your phone apps get 80% usage
  • 20% of your TV channels/shows are watched 80% of the time
  • 20% of the cities have 80% of the population

I’m pretty sure you got the idea. Having this theory in mind, I have been applying this method to improve my Spanish skills. According to Lingvist (2019), there are about 93,000 Spanish words in the Real Academia dictionary. That is, according to this theory, if I learn the top 20% of the most commonly used words in Spanish, I should be able to have proficient conversations with Spanish speakers.

For instance, I have recently combined two of my interests: data analytics and learning to speak Spanish. I also decided to use the first season of one of my favourite shows, How I Met Your Mother (HIMYM), to spice things up. The reason why I’ve chosen HIMYM is that I am very familiar with the vocabulary used on the show, and it represents some of the day-to-day dialogues in life. That’s what I did:

  1. Downloaded the HIMYM Spanish subtitles from a streaming service
  2. Scrapped a table with the 1000 most spoken words in Spanish with English translation
  3. Cleansed the data in Power BI
  4. Created a relationship model between both tables
  5. Analysed the most common words in the first season of the show and their translation into English

It seems simple but there were some challenges during this journey that I would like to share in this article. Let’s get started!

Getting Spanish subtitles

Firstly, as far as I am concerned, it isn’t illegal to use subtitles for non-profit projects like this. However, in order to avoid any confusion, I will only provide the tutorials that I have used to obtain the files and discuss some of the difficulties that I encountered.

Unfortunately, I couldn’t find a way to obtain the subtitles for all episodes or seasons at once. Therefore, I had to download them one by one. For that, you can use either of the following methods:

  1. Dev tools in Chrome/Firefox
  2. Chrome extension + script

After downloading them, usually in a .srt file, I recommend converting to .XLS which can then be easily manipulated in Excel or Power BI via DAX. I’ve tried a few online converters, but this one has converted in a nicer way to manipulate the data. Then, open and delete all columns that you are not going to use in the data analysis. In my case, I deleted everything but the subtitles column.

Assuming that everything went well until this point, it is time to combine all .XLS files into a single file. It can be done in Power BI, by getting all data from the same folder and clicking on “Combine & Transform Data”

Figure 2 — Combine several .xls files into a single file

Data cleansing

Once the table has been imported into Power BI, it is time to clean the data in order to extract all words from the subtitles. The “Subtitles” column in the.xls table contains Spanish sentences from all dialogues in the show, which is not very useful for our analysis. As a result, we must perform the following transformations in Power Query:

  • Remove “<i>” and “</i>” strings
  • Remove symbols (such as -,.?!¿¡””)
  • Split columns based delimited by space — so you can get the words instead of sentences
  • Change every word to lower or uppercase to keep consistency
  • Remove blank cells

You can perform these transformations by using either the command below in the advanced editor query or via the Power Query UI, as shown in Figure 3.

let Source = Folder.Files(“C:\Users\youruser\Documents\subtitles\es\testcsv”), #”Filtered Hidden Files1" = Table.SelectRows(Source, each [Attributes]?[Hidden]? <> true), #”Invoke Custom Function1" = Table.AddColumn(#”Filtered Hidden Files1", “Transform File”, each #”Transform File”([Content])), #”Renamed Columns1" = Table.RenameColumns(#”Invoke Custom Function1", {“Name”, “Source.Name”}), #”Removed Other Columns1" = Table.SelectColumns(#”Renamed Columns1", {“Source.Name”, “Transform File”}), #”Expanded Table Column1" = Table.ExpandTableColumn(#”Removed Other Columns1", “Transform File”, Table.ColumnNames(#”Transform File”(#”Sample File”))), #”Changed Type” = Table.TransformColumnTypes(#”Expanded Table Column1",{{“Source.Name”, type text}, {“Column1”, type text}}), #”Replaced Value” = Table.ReplaceValue(#”Changed Type”,”<i>”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value1" = Table.ReplaceValue(#”Replaced Value”,”</i>”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value2" = Table.ReplaceValue(#”Replaced Value1",”-”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value3" = Table.ReplaceValue(#”Replaced Value2",”?”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value4" = Table.ReplaceValue(#”Replaced Value3",”.”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value5" = Table.ReplaceValue(#”Replaced Value4",”,”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value6" = Table.ReplaceValue(#”Replaced Value5",”!”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value7" = Table.ReplaceValue(#”Replaced Value6",”¿”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value8" = Table.ReplaceValue(#”Replaced Value7",”¡”,””,Replacer.ReplaceText,{“Column1”}), #”Replaced Value9" = Table.ReplaceValue(#”Replaced Value8",””””,””,Replacer.ReplaceText,{“Column1”}), #”Split Column by Delimiter” = Table.ExpandListColumn(Table.TransformColumns(#”Replaced Value9", {{“Column1”, Splitter.SplitTextByDelimiter(“ “, QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), “Column1”), #”Changed Type1" = Table.TransformColumnTypes(#”Split Column by Delimiter”,{{“Column1”, type text}}), #”Filtered Rows” = Table.SelectRows(#”Changed Type1", each [Column1] <> null and [Column1] <> “”), #”Lowercased Text” = Table.TransformColumns(#”Filtered Rows”,{{“Column1”, Text.Lower, type text}}), #”Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#”Lowercased Text”, {{“Column1”, Splitter.SplitTextByDelimiter(“#(lf)”, QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), “Column1”), #”Changed Type2" = Table.TransformColumnTypes(#”Split Column by Delimiter1",{{“Column1”, type text}}), #”Filtered Rows1" = Table.SelectRows(#”Changed Type2", each [Column1] <> null and [Column1] <> “”) in #”Filtered Rows1"

Figure 3 - Data cleansing in Power Query

Finally, close and apply the transformations to the table to start the data analysis.

Data scraping

In addition to the previous steps, I have also scrapped data from the top 1000 most common Spanish words by using this app. It is quite handy and it works for several websites that you may be interested in getting data from. My intention in downloading this table was to make a comparison between these top common words with the vocabulary used in the show and to generate an intersection table. Hence, by creating a relationship between both tables, I could translate the Spanish words from HIMYM into English.

Figure 4 — Data Model

Data analysis

As shown in the figure below, those are the top 20 most spoken Spanish words in HIMYM season 1. Interestingly, the same word may have multiple translations depending on the context, such as no, que, en, which is pretty similar to my first language, Portuguese. Additionally, most of these words are connecting words which are essential to link ideas.

Figure 5 — Most spoken Spanish words on HIMYM

Another interesting graph is displayed in Figure 6, which shows that the top 100 most spoken words in the show represent 1.44% of the total unique words spoken in all episodes in the first season. On the other hand, if we consider the non-unique total of words spoken in the first season, they account for nearly 35% of the total words in the first season. That’s insane!

Figure 6 — Top 100 most spoken words in HIMYM comparison

Further, to aggregate more value to the research, I’ve done a second analysis by removing the stop words from my table. From the NLP viewpoint, stop words are basically a set of commonly used words in any language. Kavita explains in an outstanding manner what are stop words in this article. In her words:

The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example, in the context of a search engine, if your search query is “how to develop information retrieval applications”, If the search engine tries to find web pages that contained the terms “how”, “to” “develop”, “information”, ”retrieval”, “applications” the search engine is going to find a lot more pages that contain the terms “how”, “to” than pages that contain information about developing information retrieval applications because the terms “how” and “to” are so commonly used in the English language. If we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop” “information” “retrieval” “applications” — which would bring up pages that are actually of interest. This is just the basic intuition for using stop words.

As a result, without the stop words, those are the top Spanish words in HIMYM. “Ted,” the series’ main character, is the most frequently used word. Naturally, “Hola” (hello) is the second most frequently used, followed by:

  • — Know
  • Marshall — Series’ character
  • Noche — night
  • Robin/Lily/Barney- Series characters
  • Gracias- thanks
  • Creo- I believe
  • Chica- girl
  • Genial- brilliant
  • Hacer- to do
  • Dios- God
  • Quiero — I want
  • Tienes- you have
  • Siento- I feel
  • Amigo- friend
  • Fiesta- party
Figure 7 — Most spoken Spanish words without stop words

Conclusion

Bueno mis amigos, yo espero que os haya gustado de ese articulo. To sum up, this is a great Logarithmic function that displays the percentage of the total words spoken in the show the top 100 words represent. It’s important to highlight that this is a very small sample (100 words out of 6922 unique words) in this research. This outcome though, suggests that by learning the Top 4000 or 5000 words out of the 93,000 in Spanish, you can actually have daily conversations (Intermediary). In my view, this result supports the 80/20 theory. Please, hope you guys had fun reading this article and can apply it to other shows popular shows (a.k.a. Friends) and languages. Cheers!

Figure 8 — Logarithmic function

--

--