Basic Tweet Preprocessing Method With Python

Anil Emrah
Analytics Vidhya
Published in
5 min readAug 31, 2019

--

Hi guys! This story will be about my Tweet preprocessing method which will be used to clean tweets for better processing for NLP topics. I want to explain why we need preprocessing and how can we do that in this story.

NLP topics are consist of mainly try to figure out texts, produced by people in their usual life. To be more specific for this story, published tweets. In general, we are trying to predict labels of tweets for example if they are positive tweets or negative ones, if it is human generated tweet or automatic tweet, if it is fraud or not etc. We can understand those labels according to words in it. But we are facing some problems while doing it. For instance, as you all guess, nobody checks the spelling rules before publishing tweets, or people use so much slang words and repetitive letters inside the word. We can find lots of example such that.

Let’s talk about this with one example. This is our example tweet which is about Nepal earthquake.

@john Prayers, For Nepalll!!! http://t.co/XYZ #Nepal #Earthquake

Like this tweet, let’s assume that we are processing thousands of tweets like this tweet and figure out we are trying to find correlation between them and label column. First of all, we need to find every unique word. To do that, we need to get rid of punctuation inside them…

--

--

Anil Emrah
Analytics Vidhya

Sweden-based full-time developer, and lifelong learner. Loves to learn and experience. Motto: Stay Hungry, Stay Foolish | @anilemrah_