Remaking of shortened (SMS/tweet/Post) slangs and word contraction into Sentences NLP
The biggest problem in NLP is shortened words in tweets or posts like but to bt, for to fr, not to nt, thre to there etc. Now the biggest problem is that there is no such word dictionary which maps all the slang words to their original English counterpart.
But for sentiment analysis, we need the full sentence and with proper meaning or else sentiment might not come as we expect.
Let's take an example and understand.
1. Only he told his mistress that he loved her. (Nobody else did)
2. He only told his mistress that he loved her. (He didn’t show her)
3. He told only his mistress that he loved her. (Kept it a secret from everyone else)
4. He told his only mistress that he loved her. (Stresses that he had only ONE!)
5. He told his mistress only that he loved her. (Didn’t tell her anything else)
6. He told his mistress that only he loved her. (“I’m all you got, sweetie — nobody else wants you.”)
7. He told his mistress that he only loved her. (Not that he wanted to marry her.)
8. He told his mistress that he loved only her. (Yeah, don’t they all…).
In the above sentences, the placement of “only” changes the meaning of the sentence completely. So basically we need the sentences semantical meaning intact so that we can the correct sentiment.
Correcting SMS Slangs:
While in my endless search for how to correct this. I came across many solutions. Let's try them one by one.
- Dictionary:
We can create a dictionary of the shortened slangs and use them to Replace the shortened text with original words.
I made two files one for contractions another for SMS slangs.
1.1. The contraction file I found on the Web , What it does is change the values like “doesn’t” to “does not” and “can’t’ve” to “cannot have” etc .
- 2 Now we go for the SMS slangs that is changing “bt” to “but”, “hv” to “have”, “nt” to “not”. But making them manually is not possible because there are so many of them. So I searched this web site and found a huge number of slangs.
So let's scrape the website and get the list:
Let us understand the code
r=http.request(‘GET’,’https://www.noslang.com/dictionary/'+alpha) soup=BeautifulSoup(r.data,’html.parser’)
This will create a request for the web page and give us the HTML page from which we will extract the keyword. “alpha” will be [a-z] as the links in the webpage are as follows
For all slangs starting with ‘A’: https://www.noslang.com/dictionary/a
For all slangs starting with ‘B’: https://www.noslang.com/dictionary/b
{‘class’:’dictionary-word’} is the class which contains words and their abbreviations
for i in soup.findAll(‘div’,{‘class’:’dictionary-word’}):
abbr=i.find(‘abbr’)[‘title’]
Abbr_dict[i.find(‘span’).text[:-2]]=abbr
“title” contains the original words and “span” contains the SMS slangs. we save it in a dictionary. [:-2] because of the last two characters won't make any sense to us.
for one in range(97,123):
linkDict.append(chr(one))
this will create the alphabet from [a-z] and will be used for appending with the link
with open(‘ShortendText.json’,’w’) as file:
jsonDict=json.dump(Abbr_dict,file)
And finally, we save the dictionary for later use.
Lets put the Whole thing to a test:
let's take an example text and see what happens. This was one of the shortened tweets I had
#Americans came in wd full support fr #India after d #PulwamaAttack , bt wait der support hs a rider.US don't want India to go on full war agnst #Pakistan ,rather they want something like #SurgicalStrike or Lil more than dt bt nt much.
After applying the Code we get something like this:
#Americans came in well done full support fr #India after d #PulwamaAttack , bit torrent wait there support headshot a rider.US do not want India to go on full war agnst #Pakistan ,rather they want something like #SurgicalStrike or Lil more than double team bit torrent nice try much.
Changes are as follows :
SMS_Slangs | what we got | expected____________________________________________
wd | well done | with
bt | bit torrent | but
der | there | there
hs | headshot | has
don't | do not | do not
dt | double team | that
bt | bit torrent | but
nt | nice try | not
Conclusion: Though it performed well in some cases, the majority was a failure
In terms of accuracy: Correct prediction/total
accuracy=2/8==25%
2. Textblob spell correction:
TextBlob is a python library and offers a simple API to access its methods and perform basic NLP tasks.
A good thing about TextBlob is that they are just like python strings. So, you can transform and play with it the same as we did in python. Below, I have shown you below some basic tasks. Don’t worry about the syntax, it is just to give you an intuition about how much-related TextBlob is to Python strings.
There is a function in TextBlob called correct() that helps in correcting the spelling. Let us try that:
Output:
Actual:
"#Americans came in wd full support fr #India after d #PulwamaAttack , bt wait der support hs a rider.US don't want India to go on full war agnst #Pakistan ,rather they want something like #SurgicalStrike or Lil more than dt bt nt much."
Using Textblob:
#Americans came in we full support fr #India after d #PulwamaAttack , it wait der support he a rider.of don't want India to go on full war against #Pakistan ,rather they want something like #SurgicalStrike or Oil more than it it it much."
Changes are as follows:
SMS_Slangs | what we got | expected____________________________________________
wd | we | with
bt | it | but
hs | he | has
US | of | US
agnst | against | against
Lil | Oil | Little
dt | it | that
bt | it | but
nt | it | not
In terms of accuracy: 1/9==11.11%
3. Final GingerGrammerCheck:
Ginger is an online tool for spelling check and grammatical errors. There is a github Api for that .so let's try that as well
Or you can use:
pip3 install gingerit
Result:
Actual:
#Americans came in wd full support fr #India after d #PulwamaAttack , bt wait der support hs a rider. US don't want India to go on full war agnst #Pakistan , rather they want something like #SurgicalStrike or Lil more than dt bt nt much.
After gingerIt:
"#Americans came in wild full support for #India after d #PulwamaAttac , but wait dear support has a rider. US don't want India to go to full war against #Pakista , rather they want something like #SurgicalStrike or Lil more than diet but not much"
Changes are as follows:
SMS_Slangs | what we got | expected___________________________________________
wd | wild | with
fr | for | for
bt | but | but
der | dear | there
hs | has | has
agnst | against | against
dt | diet | that
bt | but | but
nt | not | not
In terms of accuracy:
accuracy=6/9== 66.66%
Conclusion: In terms of all the things we saw ginger gave us a much better result.
FULL CODE IS AT:
Thanks for reading this post.
If any other Suggestions, please let me know in the Comment Sections
More Resources. You can checkout this link on sentiment analysis for more info.