Gibberish Text detection using Markov Model

Let’s understand an application of Markov Models

Gagandeep Singh

Published in

School of ML

4 min readAug 30, 2020

Before we begin, we need to understand two things

What is gibberish text?
What is Markov model?

Gibberish

Concept

Any text or word that is nonsense or doesn’t exist is Gibberish . Mostly it is done on words because for a sentence to be nonsense we will also have to consider whether correct grammar is used or not.

Example

asdgasdsd
hfihdfugud

Applications

Gibberish detection helps in improving the quality of data. We can filter out sentences that have no meaning.

Markov Models

Concept

According to Wikipedia — “In probability theory, a Markov model is a stochastic model used to model randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property).”

That looks way too complicated. Let’s try to understand it with a example.

There are 2 states — Sunny and Rainy

After a sunny day, we have probability of 0.8 that the day will still be sunny day. That leaves probability of 0.2 that the next day will be rainy day.
After a rainy day, we have a probability of 0.4 that the day will still be rainy day. That leaves probability of 0.6 that the next day will be sunny day.

The diagram above presents the same concept.

Markov Model are reinforcement learning models that provides rewards and a new state based on the actions of the agent. So, in reinforcement learning, we do not teach an agent how it should do something but presents it with rewards whether positive or negative based on its actions

Now, let’s see how can we use Markov Model for gibberish detection.

So, first we need to define a list of accepted characters. For that we will choose a-z alphabets and ‘ ‘ (space).
Now, we want to predict the probability of getting one character after other. Example- what is the probability of getting ‘a’ after ‘b’.
Let’s create a 27x27 matrix that tells us the probability of getting one word after the other.

we have initialized it with some initial value to make sure it can handle unseen data.

4. Take a large corpus of text and iterate on every line.

5. For each line calculate the n-gram and increase the count in probability matrix. Example we have n-gram of ab then will increase the value of probability_matrix(a, b) by 1 unit.

6. After getting the probability we need to normalize the values. For that I’ve divided every row by its sum and taken log of it.

7. After normalization, the probability matrix will look like this.

8. How to decide whether the word gibberish or not. For that we can try 2 approaches.

Multiply the probabilities of n-grams in Markov Chain fashion.
Add the probabilities of n-grams.

Adding probabilities seems good option because if a low probability n-gram occurs then it will impact the whole accuracy drastically whereas addition we not cause that much impact.

9. For prediction, we will generate n-grams and add the probabilities.

Drawbacks of this method?

Words that are entities will not be recognized properly. Example McDonalds.
We have not taken the case into consideration which might affect.

Future work

Using tri-gram instead of bi-gram and using 3D matrix to evaluate the same.
Adding some exceptions for rare occurring words
Evaluating machine learning models to classify the word.

GitHub link

gaganmanku96/Gibberish-Detection

Gibberish looks like real words, but it really has no meaning at all. Example - hhduaihd pip install -r…

github.com

References

https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision-processes-part-1-bf00dda41690

Gibberish Text detection using Markov Model

Let’s understand an application of Markov Models

Gibberish

Concept

Example

Applications

Markov Models

Concept

Drawbacks of this method?

Future work

GitHub link

gaganmanku96/Gibberish-Detection

Gibberish looks like real words, but it really has no meaning at all. Example - hhduaihd pip install -r…

References

Written by Gagandeep Singh