Remaking of shortened (SMS/tweet/Post) slangs and word contraction into Sentences NLP

Indresh Bhattacharyya
Coinmonks
6 min readMar 4, 2019

--

The biggest problem in NLP is shortened words in tweets or posts like but to bt, for to fr, not to nt, thre to there etc. Now the biggest problem is that there is no such word dictionary which maps all the slang words to their original English counterpart.

But for sentiment analysis, we need the full sentence and with proper meaning or else sentiment might not come as we expect.

Let's take an example and understand.

1. Only he told his mistress that he loved her. (Nobody else did)
2. He only told his mistress that he loved her. (He didn’t show her)
3. He told only his mistress that he loved her. (Kept it a secret from everyone else)
4. He told his only mistress that he loved her. (Stresses that he had only ONE!)
5. He told his mistress only that he loved her. (Didn’t tell her anything else)
6. He told his mistress that only he loved her. (“I’m all you got, sweetie — nobody else wants you.”)
7. He told his mistress that he only loved her. (Not that he wanted to marry her.)
8. He told his mistress that he loved only her. (Yeah, don’t they all…).

In the above sentences, the placement of “only” changes the meaning of the sentence completely. So basically we need the sentences semantical meaning intact so that we can the correct sentiment.

Correcting SMS Slangs:

While in my endless search for how to correct this. I came across many solutions. Let's try them one by one.

  1. Dictionary:

We can create a dictionary of the shortened slangs and use them to Replace the shortened text with original words.

I made two files one for contractions another for SMS slangs.

1.1. The contraction file I found on the Web , What it does is change the values like “doesn’t” to “does not” and “can’t’ve” to “cannot have” etc .

  1. 2 Now we go for the SMS slangs that is changing “bt” to “but”, “hv” to “have”, “nt” to “not”. But making them manually is not possible because there are so many of them. So I searched this web site and found a huge number of slangs.

So let's scrape the website and get the list:

Let us understand the code

r=http.request(‘GET’,’https://www.noslang.com/dictionary/'+alpha) soup=BeautifulSoup(r.data,’html.parser’)

This will create a request for the web page and give us the HTML page from which we will extract the keyword. “alpha” will be [a-z] as the links in the webpage are as follows

For all slangs starting with ‘A’: https://www.noslang.com/dictionary/a

For all slangs starting with ‘B’: https://www.noslang.com/dictionary/b

{‘class’:’dictionary-word’} is the class which contains words and their abbreviations

for i in soup.findAll(‘div’,{‘class’:’dictionary-word’}):

abbr=i.find(‘abbr’)[‘title’]

Abbr_dict[i.find(‘span’).text[:-2]]=abbr

“title” contains the original words and “span” contains the SMS slangs. we save it in a dictionary. [:-2] because of the last two characters won't make any sense to us.

for one in range(97,123):

linkDict.append(chr(one))

this will create the alphabet from [a-z] and will be used for appending with the link

with open(‘ShortendText.json’,’w’) as file:

jsonDict=json.dump(Abbr_dict,file)

And finally, we save the dictionary for later use.

Lets put the Whole thing to a test:

let's take an example text and see what happens. This was one of the shortened tweets I had

After applying the Code we get something like this:

Changes are as follows :

Conclusion: Though it performed well in some cases, the majority was a failure

In terms of accuracy: Correct prediction/total

accuracy=2/8==25%

2. Textblob spell correction:

TextBlob is a python library and offers a simple API to access its methods and perform basic NLP tasks.

A good thing about TextBlob is that they are just like python strings. So, you can transform and play with it the same as we did in python. Below, I have shown you below some basic tasks. Don’t worry about the syntax, it is just to give you an intuition about how much-related TextBlob is to Python strings.

There is a function in TextBlob called correct() that helps in correcting the spelling. Let us try that:

Output:

Actual:

Using Textblob:

Changes are as follows:

In terms of accuracy: 1/9==11.11%

3. Final GingerGrammerCheck:

Ginger is an online tool for spelling check and grammatical errors. There is a github Api for that .so let's try that as well

Or you can use:

pip3 install gingerit

Result:

Actual:

After gingerIt:

Changes are as follows:

In terms of accuracy:

accuracy=6/9== 66.66%

Conclusion: In terms of all the things we saw ginger gave us a much better result.

FULL CODE IS AT:

Thanks for reading this post.

If any other Suggestions, please let me know in the Comment Sections

--

--