Gibberish Text detection using Markov Model

Let’s understand an application of Markov Models

Gagandeep Singh
School of ML
4 min readAug 30, 2020

--

Before we begin, we need to understand two things

  1. What is gibberish text?
  2. What is Markov model?

Gibberish

Concept

Any text or word that is nonsense or doesn’t exist is Gibberish . Mostly it is done on words because for a sentence to be nonsense we will also have to consider whether correct grammar is used or not.

Example

  1. asdgasdsd
  2. hfihdfugud

Applications

Gibberish detection helps in improving the quality of data. We can filter out sentences that have no meaning.

Markov Models

Concept

According to Wikipedia — “In probability theory, a Markov model is a stochastic model used to model randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property).”

That looks way too complicated. Let’s try to understand it with a example.

source

There are 2 states — Sunny and Rainy

  1. After a sunny day, we have probability of 0.8 that the day will still be sunny day. That leaves probability of 0.2 that the next day will be rainy day.
  2. After a rainy day, we have a probability of 0.4 that the day will still be rainy day. That leaves probability of 0.6 that the next day will be sunny day.

The diagram above presents the same concept.

Markov Model are reinforcement learning models that provides rewards and a new state based on the actions of the agent. So, in reinforcement learning, we do not teach an agent how it should do something but presents it with rewards whether positive or negative based on its actions

Now, let’s see how can we use Markov Model for gibberish detection.

  1. So, first we need to define a list of accepted characters. For that we will choose a-z alphabets and ‘ ‘ (space).
  2. Now, we want to predict the probability of getting one character after other. Example- what is the probability of getting ‘a’ after ‘b’.
  3. Let’s create a 27x27 matrix that tells us the probability of getting one word after the other.
probability matrix

we have initialized it with some initial value to make sure it can handle unseen data.

4. Take a large corpus of text and iterate on every line.

5. For each line calculate the n-gram and increase the count in probability matrix. Example we have n-gram of ab then will increase the value of probability_matrix(a, b) by 1 unit.

6. After getting the probability we need to normalize the values. For that I’ve divided every row by its sum and taken log of it.

7. After normalization, the probability matrix will look like this.

normalized probability matix

8. How to decide whether the word gibberish or not. For that we can try 2 approaches.

  • Multiply the probabilities of n-grams in Markov Chain fashion.
  • Add the probabilities of n-grams.

Adding probabilities seems good option because if a low probability n-gram occurs then it will impact the whole accuracy drastically whereas addition we not cause that much impact.

9. For prediction, we will generate n-grams and add the probabilities.

Drawbacks of this method?

  1. Words that are entities will not be recognized properly. Example McDonalds.
  2. We have not taken the case into consideration which might affect.

Future work

  1. Using tri-gram instead of bi-gram and using 3D matrix to evaluate the same.
  2. Adding some exceptions for rare occurring words
  3. Evaluating machine learning models to classify the word.

GitHub link

References

  1. https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision-processes-part-1-bf00dda41690

--

--

Gagandeep Singh
School of ML

Data Scientist | NLP | Chatbot | Docker | Kubernetes