Learn Language through Lyrics

Scraping the web for Mandarin lyrics and scoring their difficulty

One of the hardest parts about learning a language is staying committed to the cause. As a beginner Mandarin student I’m constantly looking for new mediums to practice with in order to maintain my engagement as well as reduce the likelihood of gaps in my knowledge. Just recently I’ve discovered Chinese music and there’s a lot that I like, but often the lyrics are very complicated and learning them can just be too difficult. I searched for recommendations on songs to learn, but the majority were outdated with little indicator on the actual difficulty. With this in mind, I set out to classify the songs and give myself something to practice with.

This article describes how to crawl the web for Chinese songs and apply basic language processing techniques to judge their difficulty and finally make recommendations for learning.

If you’re just looking for songs to learn, you can check this article with the results: https://medium.com/@jyesawtellrickson/6-great-songs-for-chinese-beginners-1a679b4c6392

Learning with (Fun) Repetition

Finding the Data

Thankfully there are many websites with Mandarin lyrics, including KPop Scene and my favourite, Lyrics Translate. So in order to get the lyrics to build our database we can scrape these websites, iterating through the list of all Mandarin lyrics. This process is fairly simple using the ever-helpful scrapy library (you can read more about how I’ve used this in the past here).

It’s important to note here that we don’t want to collect just the lyrics but also as much information about the songs as we can. In particular, we need the title and the artist as a unique identifier for the song which will allow us to combine them with other data sources. Any further information such as release date and genre are also helpful since, for example, we may be more interested in recent songs from the rap genre.

The songs were cleaned by removing all punctuation and English words and where necessary, converting Traditional Chinese characters to Simplified Chinese. For the purpose of discovering the popular characters a stopword list was also used.

For the analysis in this article, approximately 4000 songs were sourced. The songs are summarised below. The median song length is around 300 characters and contains 30 unique characters. The median for the average HSK level of the songs is 3, so on average the characters aren’t too difficult — more on this later.

Image for post
Image for post
Plots showing histograms of the Unique Characters, Total Characters, Uses per Word and the Average HSK Level for the songs used in the analysis. The median is shown in a black dotted line.

Let’s also have a look at the artists in our collection, let’s rank them by the number of songs that we have. Pop artists are very popular (it’s in the name, really) and it’s no surprise they dominate the list, but it’s interesting to see that five out of the top seven are solo male artists and the other two are all-male groups, with the first female artist coming in at position eight.

Image for post
Image for post
The top artists in our collection of songs.

If you’re familiar with Chinese songs, then the following will come as no surprise, the top words used throughout the song titles include: love, heart, feeling, flower — how nice. A selection of the song titles that use these characters can be seen below.

Image for post
Image for post
Song titles including including some of the most popular characters.

Measuring Difficulty

One key difference between English and Mandarin is that Mandarin doesn’t have spaces between their words. For example, 这个东西真心很赞 should be separated as 这个|东西|真心|很|赞, but at the same time 西 can be used in many other cases such as 西海岸 in which case the separation would obviously be different. The way characters are separated can often be subtle and is non-trivial. In order to split the sentences we will be SnowNLP which is a neat library for working with Mandarin. With this we can split our sentences using some complex natural language processing algorithms that take into account the grammar and context of the sentence.

Now that we’ve split our lyrics into individual characters, or words, we can cross-check with our HSK dictionary and classify each of them, and with this classification we must work out some sort of scoring mechanism. We will use two main types for this:

  • Average HSK Level
  • Readability ~ Count of unknown Characters

Readability will simply be defined as the number of words that the reader can, or should, understand divided by the total number of words. For example, if the reader is on HSK 3 the calculation would be (num HSK 1 words + num HSK 2 words + num HSK 3 words) / total words , which gives us a percentage. We can get the number of unknown characters from this by subtracting from 1 and multiplying by the total words (1 — readability) * total words , this will come in handy in a bit.

Average HSK Level is simply the average of the HSK level of the different characters used in the song.

Now we have two metrics which we can use in identifying the best songs for learning.

The Results

Image for post
Image for post
Histogram of New Characters (horizontal axis) across different HSK levels (vertical axis).

It can be seen that with HSK 1 level the vast majority of songs have more than 10 new characters and would be a struggle to learn. It’s not until one has grasped HSK 3 where a large number of songs become available for learning. Thus, as a learning method this is recommended for those who are at HSK 3 or above (lucky for me, that’s my exact level!).

Learning the Songs

  • MandarinSpot: this tool allows the annotation of Mandarin text with convenient tooltips to display and unknown characters.
  • Youtube: with the trusty Youtube, it’s possible to search for the songs according to the name and artist stored earlier and listen to our heart’s content.


Going further, it would be great to integrate this technology into a learning app. In doing so, it would be possible to maintain a record of the exact words that the user is comfortable with and then base the scoring and recommendations on this, rather than basing it off just the HSK system.

Written by

Talking about data science, product analytics, and artificial intelligence.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store