Tweets Classification and Clustering in Python.

Getting started with KMeans Clustering on text data.

Ada kibet

Published in

The Startup

8 min readJul 25, 2020

Introduction.

Twitter is a social networking and micro blogging service on which users post and interact with each other through messages known as “tweets”. It’s ranked as the 6th most popular social networking site and app by Dream Grow as of April, 2020 with an average of 330 million active monthly users.

Unlike other platforms like Facebook whose main role is to play ‘catch-up’ with friends, it is where people let loose and engage with different personalities from all walks of life on all sorts of matters. This atmosphere is what makes it the ideal platform for marketers, politicians and other titles whose success depend on a deep understanding of people’s views.

Through sentiment analysis, interested parties can understand what users are talking about and from the insights, make the appropriate decisions. This post focuses on classifying tweets into 4 major categories: Economic, Social, Cultural and Health then performing KMeans cluster analysis on the groups.

Data set

The data used is scraped from twitter using Tweepy, a python library for accessing the Twitter API. It has 197802 tweets from different users from Kenya. Code to scrap the data is available in this repository.

The data set is called tweets_bowl.

A random sample:

Sample data containing usernames and tweets.

Libraries

The following libraries will be used throughout the post.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import spacy
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import RegexpTokenizer, WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from string import punctuation
import collections
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import en_core_web_smfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

Preprocessing

Tweets Cleaning

Tweets contains unnecessary objects like hashtags, mentions, links and punctuation that can affect the performance of an algorithm thus they have to be rid off. All the texts are converted to lower case to avoid algorithms interpreting same words with different cases as different.

# remove the hashtags, mentions and unwanted characters.def clean_text(df, text_field):
    df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r”         (@[A-Za-z0–9]+)|([⁰-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?”, “”,   elem)) 
    return dftweets_bowl = clean_text(tweets_bowl, ‘tweets’)
tweets_bowl.head()```

Tokenization, Lemmatization and removing stopwords

Stopwords are commonly used words whose presence in a sentence has less weight compared to other words. They include words like ‘and’, ‘or’, ‘has’ et.c.

Tokenization is the process of splitting a string into a list of tokens. A sentence can be reduced to words and a word can be reduced to letters using the appropriate tokenizers.

Lemmatization is reducing a word to it’s root form. For instance the root form of ‘rocks’ is ‘rock’.

Languages used in the tweets are mainly English and Swahili. The latter has no support thus we’ll only work with the former . This renders the analysis crippled in a way given that the Swahili texts will be ignored.

nlp = en_core_web_sm.load() 
tokenizer = RegexpTokenizer(r’\w+’)
lemmatizer = WordNetLemmatizer()
stop = set(stopwords.words(‘english’))
punctuation = list(string.punctuation) #already taken care of with the cleaning function.
stop.update(punctuation)
w_tokenizer = WhitespaceTokenizer()def furnished(text):
    final_text = []
    for i in w_tokenizer.tokenize(text):
       if i.lower() not in stop:
       word = lemmatizer.lemmatize(i)
       final_text.append(word.lower())
    return “ “.join(final_text)tweets_bowl.tweets = tweets_bowl.tweets.apply(furnished)

The data set after preprocessing:

Tweets Classification

This approach uses the technique of creating a set of words that can be confidently classified as belonging to a particular category for each of the 4 classes. (Economic, Social, Cultural and Health)

The tweets are each compared with the 4 sets and assigned a similarity score. There’re 2 main techniques popular for computing similarity score between documents:

1. Cosine Similarity: Cosine similarity is a metric used to measure how similar documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. This would involve creating word vectors for the set of words and all the tweets then performing the cosine similarity. TFIDF (bag of words model) Vectorizer would be ideal for the vectorization.
2. Jaccard Similarity: Jaccard similarity or intersection over union is defined as size of intersection divided by size of union of two sets.

Jaccard similarity takes only unique set of words for each sentence or document while cosine similarity takes total length of the vectors. Jaccard similarity is good for cases where duplication does not matter, cosine similarity is good for cases where duplication matters. In our case, context matters more than duplication thus Jaccard similarity is the ideal technique to use.

Sets of words

The block below represents economy related words. There’s 3 other such sets (social_related_words, health_related_words and culture_related_words) for the 3 remaining groups.

economy_related_words = '''agriculture infrastructure capitalism trading service sector technology  economical supply industrialism efficiency frugality retrenchment downsizing   credit debit value economize   save  economically economies sluggish rise   rising spending conserve trend low-management  decline   industry impact poor  profession    surplus   fall declining  accelerating interest sectors balance stability productivity increase rates pushing expanding stabilize  rate industrial borrowing strugglingdeficit predicted    increasing  data economizer analysts investment market-based economy   debt free enterprise medium  exchange metric savepoint scarcity capital bank company stockholder fund business  
asset treasury tourism incomes contraction employment jobs upturn deflation  macroeconomics bankruptcies exporters hyperinflation dollar entrepreneurship upswing marketplace commerce devaluation quicksave deindustrialization stockmarket reflation downspin dollarization withholder bankroll venture capital mutual fund plan economy mortgage lender unemployment rate credit crunch central bank financial institution bank rate custom duties mass-production black-market developing-countries developing economic-growth gdp trade barter distribution downturn economist'''

Just like the tweets, they have to undergo some preprocessing.The function furnished used on the tweets is applied on the sets.

economy = furnished(economy_related_words)
social = furnished(social_related_words)
culture = furnished(culture_related_words)
health = furnished(health_related_words)

The duplicates are also dropped:

string1 = economy
words = string1.split()
economy = " ".join(sorted(set(words), key=words.index))
economystring1 = social
words = string1.split()
social = " ".join(sorted(set(words), key=words.index))
socialstring1 = health
words = string1.split()
health = " ".join(sorted(set(words), key=words.index))
healthstring1 = culture
words = string1.split()
culture = " ".join(sorted(set(words), key=words.index))
culture

Jaccard Similarity Scores

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)
def get_scores(group,tweets):
    scores = []
    for tweet in tweets:
        s = jaccard_similarity(group, tweet)
        scores.append(s)
    return scorese_scores = get_scores(economy, tweets_bowl.tweets.to_list())
s_scores = get_scores(social, tweets_bowl.tweets.to_list())
c_scores = get_scores(culture, tweets_bowl.tweets.to_list())
h_scores = get_scores(health, tweets_bowl.tweets.to_list())

There might be a thin line between the economic and social scores depending on the sets of words used.

Clustered Data Frame

We wish to create a data frame containing total number of tweets per category per person. A 4D data frame with the index column populated with users, and 3 other columns containing total number of the user’s tweets under social, cultural, health and economic classes.

This can be achieved first by creating a data frame containing Jaccard scores for each tweet for each category, then assigning a tweet to a category depending on the highest score and finally grouping the tweets by user names and sum of the tweets.

# create a jaccard scored df.
data  = {'names':tweets_bowl.screen_name.to_list(),       'economic_score':e_scores,
         'social_score': s_scores, 'culture_score':c_scores, 'health_scores':h_scores}scores_df = pd.DataFrame(data)#assign classes based on highest score
def get_classes(l1, l2, l3, l4):
    econ = []
    socio = []
    cul = []
    heal = []
    for i, j, k, l in zip(l1, l2, l3, l4):
        m = max(i, j, k, l)
        if m == i:
            econ.append(1)
        else:
            econ.append(0)
        if m == j:
            socio.append(1)
        else:
            socio.append(0)        
        if m == k:
            cul.append(1)
        else:
            cul.append(0)  
        if m == l:
            heal.append(1)
        else:
            heal.append(0)   
            
    return econ, socio, cul, heall1 = scores_df.economic_score.to_list()
l2 = scores_df.social_score.to_list()
l3 = scores_df.culture_score.to_list()
l4 = scores_df.health_scores.to_list()econ, socio, cul, heal = get_classes(l1, l2, l3, l4)data = {'name': scores_df.names.to_list(), 'economic':econ, 'social':socio, 'culture':cul, 'health': heal}
class_df = pd.DataFrame(data)
#grouping the tweets by username
new_groups_df = class_df.groupby(['name']).sum()#add a new totals column
new_groups_df['total'] = new_groups_df['health'] + new_groups_df['culture'] + new_groups_df['social'] +  new_groups_df['economic']#add a new totals row
new_groups_df.loc["Total"] = new_groups_df.sum()

The final data frame:

Below is a pie chart to show the tweets volumes in the different categories:

fig = plt.figure(figsize =(10, 7)) 
a = new_groups_df.drop(['total'], axis = 1)
plt.pie(a.loc['Total'], labels = a.columns)
plt.title('A pie chart showing the volumes of tweets under different categories.')plt.show()

Health has the largest percentage. This could be as a result of the current pandemic that everyone is talking about.

The data can be played with for loads of analysis and beautiful visualisations but the focus of the post is cluster analysis.

KMeans Clustering.

Distance computation in k-Means weighs each dimension equally and hence care must be taken to ensure that unit of dimension shouldn’t distort relative near-ness of observations. Common method is to unit-standardize each dimension individually.

The unit for the variables of interest are the same: Number of tweets, thus no need for standardization. The code below would standardize a column ’a’ if there was the need:

df.a = StandardScaler().fit_transform(df.a.values.reshape(-1,1))

We will work with 2D clustering, i.e clustering between 2 variables. There are different methods to determine the optimal number of clusters and one of them is the elbow method. The approach consists of looking for an elbow in the WCSS graph. Usually, the part of the graph before the elbow would be steeply declining, while the part after it would be smoother.

X = new_groups_df[['economic', 'social']].values# Elbow Method
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('wcss')
plt.show()

Taking k = 3.

# fitting kmeans to dataset
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, random_state=0)
Y_kmeans = kmeans.fit_predict(X)# Visualising the clusters
plt.scatter(X[Y_kmeans==0, 0], X[Y_kmeans==0, 1], s=70, c='violet', label= 'Cluster 1')
plt.scatter(X[Y_kmeans==1, 0], X[Y_kmeans==1, 1], s=70, c='cyan', label= 'Cluster 2')
plt.scatter(X[Y_kmeans==2, 0], X[Y_kmeans==2, 1], s=70, c='green', label= 'Cluster 3')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=70, c='black', label='Centroids' )
plt.title('Clusters of tweets in economic and social groups')
plt.xlabel('economic tweets')
plt.ylabel('social tweets')
plt.legend()
plt.show()

The plot above indicates most of the users share more economy-centred tweets compared to social tweets. There’s a few who try to maintain a balance between the categories.

The same method can be implemented on the other pairs to observe how they relate and interpretations made.

Conclusion

Natural Language Processing is a vast field and there’s so much more that could be done on the data to get more precise and useful insights. It’s worth exploring!