Of Chains and Linguistics: Applying the Markov Model to Language

Kai Wei Tan
7 min readAug 13, 2021

--

As I transition from student to a member of the full-time workforce, I have been doing a lot of reflecting on moments in NUS and the various takeaways. My final-year project, or Honors Dissertation, was a defining moment in my four years — especially since it was rare for a Business student to choose the research route to meet degree requirements.

Because of my Business Analytics specialization, my coursework was closely tied with data science. Interestingly, most of my projects ended up revolving around natural language processing (NLP) and/or text analytics.

For my final-year project, I embarked on the ambitious objective of analyzing the grammar of Singlish (Singapore colloquial English) — something not yet widely explored in the field. It turned out to be a really meaningful way to cap off my final semester, as an opportunity to apply the various data science skills I learnt as well as experience from previous projects.

One of the models I used was the Markov chain, which I first learnt about in early 2020 as part of a stochastic modelling module. After using it previously in a text analytics project, I decided to take it further and explore its potential for this project.

State to state

Simply put, a Markov chain models a number of distinct, known states (known as a state space), and the probabilities that a system would transit from one state to another.

Example illustration by V. Powell

The defining feature of a Markov chain is that each transition is independent, that is, the probability of each transition is not affected by the probabilities of any past transition. This is referred to as the Markov property or “memorylessness”.

Markov chains are also categorized into discrete-time or continuous-time, depending on whether the system’s transitions follow discrete or continuous time steps.

A discrete-time Markov chain could be a simple model tracking the weather each day. On the other hand, a continuous-time Markov chain is usually more complicated — an example might be a queuing model, where people would join and leave the queue at rates following, say, an exponential distribution with means of 5 and 3 minutes respectively.

Applying this to text

We can consider text as a discrete-time system — the state space being all possible words, and at each time step, the system “jumps” from one word to another, eventually forming phrases, sentences, and so on.

Let’s see how this works with a small set of words! I’ve also added in some made-up probabilities, so just imagine that if the current word is “he”, the probability that the next word will be “is” would be 100%.

Illustration of a discrete-time Markov chain for text

Interestingly, this forms the basis of predictive text! (I also explored this in my previous project.) Of course, modern predictive text software has grown much more complex, using even more cutting-edge models.

Here comes the big data

Now that we have a better idea of how we can model text as transitions between words, we can then apply this to large amounts of text.

In particular, since my project aimed to provide a more quantitative analysis of Singlish, I had to work out a way to parse though large amounts of text, and then get the transition probabilities between each and every word. As such, I wrote code that did the following:

  • Record every pair of two back-to-back words as a transition, each with a start_word and end_word.
  • Count the number of occurrences for each and every transition and save it as transitions. (This will be important for later.)
  • The transition probability of start_word X to end_word Y, or prob, could then be calculated using: total number of transitions with start_word X ÷ number of transitions with start_word X and end_word Y. Do this for all start_word and end_word.
  • Output the entire result set in a dataframe (i.e. table).

(And, of course, not forgetting the pre-processing which involved tasks like standardizing everything to lowercase, as well as manual correction of commonly misspelled words.)

The result would look something like this:

Using the same example from the previous section

My project dataset comprised of 20,000 sentences from a dataset of text messages contributed by NUS students and compiled by the School of Computing (which I call the NUS corpus), as well as 10,000 sentences scraped randomly from Singaporean local forums (Sammyboy and HardwareZone — do note that many of their sub-forums are NSFW).

It took about 20 minutes for my code to process all my text data, perhaps showing how much more memory text data tends to take as compared to, say, numeric data — though I’ve since improved on the code to run more efficiently. Hopefully this made things faster!

A more quantitative approach to linguistics

Extensive research has been done on the “patterns” of Singlish, and literature states that particular grammar features are common in Singlish. Can I compute the probability of them, based on my data?

Yes! Enter the parts_of_speech table, from the Moby Project by Grady Ward:

Using the same example from the previous section

The next piece of the puzzle comes in modelling grammatical features as — once again — transition probabilities. For example, let’s look at one the features I investigated — the use of “got” as a possessive function (as opposed to “have” as in standard English):

  • English: He has a lot of work today…
  • Singlish: He got a lot of work today…

Remember the transitions table? Since it contains all word-to-word transitions and their frequencies, we can then use it alongside the parts_of_speech table, and apply what I call Join, Filter, Aggregate (which should ring a bell for those familiar with analytics)!

Let’s see how this approach works:

  • Join both tables to get a combined table of all start_word and end_word, as well as their parts of speech and transition probabilities.
  • Filter the result, keeping only the parts of speech or particular words that we are considering for the problem.
  • Aggregate the results to get an overall probability.

And let’s see we can apply it to our previous example of “got” as a possessive function:

  • Join transitions and parts_of_speech.
  • Filter the combined table, such that we only have fields where start_word is a pronoun.
  • For each pronoun, we are interested in how many transitions result in “got” (Singlish) or “has” (English). In other words, we are considering the probability of “got”, given that the transition leads to either “got” or “has”. We get the occurrences of “got” and divide it by the sum of occurrences for “got” plus “has”. Do this for each pronoun (hence the Aggregate part).

And there we have it! For my study, I got some interesting results for this particular grammar feature:

  • For the pronoun “he”, the probability of the Singlish feature appearing was about 0.6, an example being “he got a lot of work to do leh”.
  • For “it”, however, everyone in the data used it properly in the standard English manner, for instance “it has many uses”. That is, the probability of the Singlish feature appearing was surprisingly 0.

At first glance, the probabilities were lower than expected. However, this appeared to align with existing research, which propose that the use of Singlish lies on a “spectrum”. That is, the Singlish seen in our everyday usage actually lies somewhere in between perfectly standard English and a hypothetical language with maximum Singlish features.

My data was something like this

For this project, a large proportion of data comes from students, who tend to use relatively more standard English even when texting. The other data source, local forums, may also be where users tend to largely follow to standard English grammar — though the extensive usage of Singaporean vernacular makes the text easily identifiable as Singlish.

Conclusion

I concluded my project by proposing that if a “uniquely Singaporean” text generation model were to be made, the Markov model might be one way to model the specific grammatical features that make Singlish, well, Singlish.

As a closing remark, you may also remember the key assumption of independence between state transitions in a Markov model. I acknowledge that this may not be true in the case of text — as the usage of certain words may impact the distribution of other words down the line in a sentence. This could be mitigated through a model which models phrases (groups of more than 1 word), which I also explored and wrote code for, though at least for my project the results were not as promising. (Come to think of it, this may be a topic for another post!)

Nonetheless, this was my way of bringing a more quantitative approach to a line of research which has always been, for the most part, qualitative.

Project repository

RPubs version

If this has been interesting to you, I am happy to discuss this further with you via LinkedIn or kaiwei@u.nus.edu!

--

--

Kai Wei Tan
0 Followers

Data Scientist | Business Analytics Graduate, National University of Singapore | linkedin.com/in/kaiwei-tan/ | github.com/kaiwei-tan