Sing a song of data: Markov chains, part II

Published in

The Sound of AI

3 min readMar 20, 2019

Welcome back, machine learners. This week we’ll continue with Markov chains, and use our first dataset to do what you’ve all been waiting for: machine learning. Again, it’s better if you warm up with the previous post, as we’re picking up from where we left off.

Scrambled eggs

First, let’s look at the solution to last week’s tasks. You were asked to automatically create a Markov chain transition matrix from the famous nursery rhyme “Humpty Dumpty”.

To use this text in a Markov chain, we first need to tokenise it. This is where we split the single string of text into a list of individual words. We can do this with a simple regular expression:

Running the above code will output the following list:

Next, we can create our transition matrix by iterating through this list and updating discrete distributions for each word depending on their previous word. Here’s a function that implements the full procedure.

Finally, we need an initial distribution, which in our case is quite simple, because we know that we always want to start with the word “Humpty”.

Now that we’ve constructed our Markov chain, we can use it to generate a new version of Humpty Dumpty:

Here’s an example output from our Humpty Dumpty generator:

Humpty Dumpty sat on a great fall All the king’s horses and all the king’s horses and all the king’s horses and all the king’s men couldn’t put Humpty together again

Data wrangling

Many AI algorithms rely on creating a statistical model of some dataset. This subset of AI is what’s referred to as Machine Learning (ML) and — guess what — the Humpty Dumpty example above is already a simple application of ML. (How easy was that?).

However, most ML tasks are a little trickier — the dataset will often contain information that’s not important, or need to be cleaned or transformed in some way before it becomes useful and can be given to the algorithm. This process is called data wrangling. Often, the success or failure of many ML projects comes down to data wrangling

Machine Learning is the process of manipulating your dataset so that the algorithm does what it’s supposed to do — Me

Let’s start by looking at a simple example of data wrangling. We’ll use a dataset from the data science portal Kaggle, mousehead’s Song Lyrics dataset. This dataset contains lyrics for 55000+ songs in English. We’ll use data wrangling to count how many songs this dataset contains for a specific artist.

The dataset is in a simple CSV format, similar to a spreadsheet, and contains 4 columns:

Artist
Song name
Link to a webpage
Lyrics of the song

So to find the number of songs a specific artist has, we need to loop through the rows in the dataset, check the first column against the artist we’re looking for, and increment the song count if it’s a match.

Outro

Let’s leave it there for this week. We’ve had our first crack at machine learning, and learned how to perform some simple data wrangling to extract useful information from a dataset.

I’ll leave you with a fun challenge for next week:

Can you use this week’s code to read the Song Lyrics dataset, create a Markov chain for the entire discography of a single artist, and generate a new song with that Markov chain?

Good luck and, as always, you can get the source code for this week on our GitHub.

Continue your training to supreme master-level AI coding here.

And give us a follow to receive updates on our latest posts.

Sing a song of data: Markov chains, part II

Scrambled eggs

Data wrangling

Outro

Written by Andy Elmsley