Learn Language through Lyrics

Scraping the web for Mandarin lyrics and scoring their difficulty

One of the hardest parts about learning a language is staying committed to the cause. As a beginner Mandarin student I’m constantly looking for new mediums to practice with in order to maintain my engagement as well as reduce the likelihood of gaps in my knowledge. Just recently I’ve discovered Chinese music and there’s a lot that I like, but often the lyrics are very complicated and learning them can just be too difficult. I searched for recommendations on songs to learn, but the majority were outdated with little indicator on the actual difficulty. With this in mind, I set out to classify the songs and give myself something to practice with.

This article describes how to crawl the web for Chinese songs and apply basic language processing techniques to judge their difficulty and finally make recommendations for learning.

If you’re just looking for songs to learn, you can check this article with the results: https://medium.com/@jyesawtellrickson/6-great-songs-for-chinese-beginners-1a679b4c6392

Learning with (Fun) Repetition

Before we proceed any further, we should formulate the problem in a little more detail. In order to increase learning efficiency, we need songs which contain new characters, but not too many. By learning new characters in the context of old ones it is far more likely that one can remember them, which is what this process should achieve. Our ultimate goal will thus be a list of songs that perfectly fit a certain level of Chinese studies. Based off what most learning apps tend to recommend, let’s take to this to be equivalent to around 10 new characters per song.

Finding the Data

In order to analyse the songs, first we’ll need to collect them. There are a few APIs out there for song lyrics, including the most popular, MusixMatch, but in general these are made for a Western audience and thus their collection of Mandarin songs isn’t perfect. There may exist one fully in Chinese but I wasn’t able to find one, so instead we will turn to scraping.

Thankfully there are many websites with Mandarin lyrics, including KPop Scene and my favourite, Lyrics Translate. So in order to get the lyrics to build our database we can scrape these websites, iterating through the list of all Mandarin lyrics. This process is fairly simple using the ever-helpful scrapy library (you can read more about how I’ve used this in the past here).

It’s important to note here that we don’t want to collect just the lyrics but also as much information about the songs as we can. In particular, we need the title and the artist as a unique identifier for the song which will allow us to combine them with other data sources. Any further information such as release date and genre are also helpful since, for example, we may be more interested in recent songs from the rap genre.

The songs were cleaned by removing all punctuation and English words and where necessary, converting Traditional Chinese characters to Simplified Chinese. For the purpose of discovering the popular characters a stopword list was also used.

For the analysis in this article, approximately 4000 songs were sourced. The songs are summarised below. The median song length is around 300 characters and contains 30 unique characters. The median for the average HSK level of the songs is 3, so on average the characters aren’t too difficult — more on this later.

Image for post
Plots showing histograms of the Unique Characters, Total Characters, Uses per Word and the Average HSK Level for the songs used in the analysis. The median is shown in a black dotted line.

Let’s also have a look at the artists in our collection, let’s rank them by the number of songs that we have. Pop artists are very popular (it’s in the name, really) and it’s no surprise they dominate the list, but it’s interesting to see that five out of the top seven are solo male artists and the other two are all-male groups, with the first female artist coming in at position eight.

Image for post
The top artists in our collection of songs.

If you’re familiar with Chinese songs, then the following will come as no surprise, the top words used throughout the song titles include: love, heart, feeling, flower — how nice. A selection of the song titles that use these characters can be seen below.

Image for post
Song titles including including some of the most popular characters.

Measuring Difficulty

Now that we have songs, how do we rate them? For learning Mandarin the typical rating system is the 汉语水平考试, also known as the HSK system. With this system, Mandarin words are separated into different levels based on their usage in everyday life. The levels range from 1 to 6, with many characters also existing outside the system. For reference, in order to study at most Chinese universities one must pass HSK 4 and to attain fluency, HSK 6 is a good starting point. For each of our songs, we’d like to divide the words into their class and then we can do something meaningful with them.

One key difference between English and Mandarin is that Mandarin doesn’t have spaces between their words. For example, 这个东西真心很赞 should be separated as 这个|东西|真心|很|赞, but at the same time 西 can be used in many other cases such as 西海岸 in which case the separation would obviously be different. The way characters are separated can often be subtle and is non-trivial. In order to split the sentences we will be SnowNLP which is a neat library for working with Mandarin. With this we can split our sentences using some complex natural language processing algorithms that take into account the grammar and context of the sentence.

Now that we’ve split our lyrics into individual characters, or words, we can cross-check with our HSK dictionary and classify each of them, and with this classification we must work out some sort of scoring mechanism. We will use two main types for this:

  • Average HSK Level
  • Readability ~ Count of unknown Characters

Readability will simply be defined as the number of words that the reader can, or should, understand divided by the total number of words. For example, if the reader is on HSK 3 the calculation would be (num HSK 1 words + num HSK 2 words + num HSK 3 words) / total words , which gives us a percentage. We can get the number of unknown characters from this by subtracting from 1 and multiplying by the total words (1 — readability) * total words , this will come in handy in a bit.

Average HSK Level is simply the average of the HSK level of the different characters used in the song.

Now we have two metrics which we can use in identifying the best songs for learning.

The Results

It’s instructive to look at the data and see what the difficulty in learning songs will be. In particular, we’d like to know at what HSK level is it possible to start learning from Chinese songs? For this, let’s consider that songs should have no more than 10 new characters when learning. We can then plot a histogram of the new characters in each song based on what one should know at a certain HKS level. Such a plot is shown below.

Image for post
Histogram of New Characters (horizontal axis) across different HSK levels (vertical axis).

It can be seen that with HSK 1 level the vast majority of songs have more than 10 new characters and would be a struggle to learn. It’s not until one has grasped HSK 3 where a large number of songs become available for learning. Thus, as a learning method this is recommended for those who are at HSK 3 or above (lucky for me, that’s my exact level!).

Learning the Songs

In order to put this all to use we can begin by filtering our songs according to our HSK level with readability over 60% (we should know half the characters at least) and the number of new characters no more than 10. This will leave us with a list of songs we can begin to study. We can then use the following two tools to actually learn the songs:

  • MandarinSpot: this tool allows the annotation of Mandarin text with convenient tooltips to display and unknown characters.
  • Youtube: with the trusty Youtube, it’s possible to search for the songs according to the name and artist stored earlier and listen to our heart’s content.

Conclusion

With this method it’s possible to discover new songs and learn them at a much faster rate and improve your overall Mandarin ability. As ability continues to approve, the readability metric can be adjusted and give access to further and further songs (as well as adjusting the HSK level). Further steps would be categorisation of songs according to genre and year which would allow for the choice of songs that are of more interest.

Going further, it would be great to integrate this technology into a learning app. In doing so, it would be possible to maintain a record of the exact words that the user is comfortable with and then base the scoring and recommendations on this, rather than basing it off just the HSK system.

Written by

Talking about data science, product analytics, and artificial intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store