Disaster Tweets Classification using gloVe

Published in

hackerdawn

6 min readMay 25, 2021

Disasters can be unexpected and life-threatening. We will analyze and classify disaster tweets in this tutorial. For this purpose, we will use the Disaster Tweets dataset from Kaggle.

Importing Libraries

Let’s first import the required libraries. If you don’t have a particular library installed, run the command ‘pip install <package_name>’ to install it.

import re
import string
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import defaultdict
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding,LSTM,Dense,SpatialDropout1Dfrom keras.initializers import Constant
from keras.optimizers import Adam

Additional Requirements

We have some additional requirements. Let’s download them.

nltk.download('stopwords')
nltk.download('punkt')

Loading the Dataset

We’ll have downloaded the datasets from Kaggle. Let us load them.

train = pd.read_csv('./nlp-getting-started/train.csv')
test = pd.read_csv('./nlp-getting-started/test.csv')train.head()

Initial Exploration

Let’s see the distribution of target values.

plt.style.use('ggplot')
sns.countplot(x = 'target', data=train)
plt.show()

Let’s see the distribution of the number of words in tweets.

train['total_words'] = train['text'].apply(lambda x: len(str(x).split()))sns.histplot(x=train[train['target'] == 0]['total_words'], label='Not Disaster', kde=True,color='#3398FF')sns.histplot(x=train[train['target'] == 1]['total_words'], label='Disaster', kde=True,color='#FF3333')plt.legend()
plt.show()

Let’s see the distribution of the number of characters in tweets.

train['total_chars'] = train['text'].apply(lambda x: len(x))sns.histplot(x=train[train['target'] == 0]['total_chars'], label='Not Disaster', kde=True,color='#3398FF')sns.histplot(x=train[train['target'] == 1]['total_chars'], label='Disaster', kde=True,color='#FF3333')plt.legend()
plt.show()

We will concatenate train and test to create a dataframe df.

df = pd.concat([train,test])
df.shape

Data Cleaning

We’ll convert text to lowercase, remove text in square brackets, remove links, remove punctuation and remove words containing numbers.

def clean_text(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('amp', '', text)
    return text
df['text'] = df['text'].apply(lambda x: clean_text(x))

Also, let’s remove emoji from the text and see how the data looks like.

def remove_emoji(text):
    emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)df['text']=df['text'].apply(lambda x: remove_emoji(x))
df.head()

Data Visualization

Let’s define a function that creates a corpus for a specified target.

def create_corpus(target):
    corpus=[]
    for x in df[df['target']==target]['text'].str.split():
        for i in x:
            corpus.append(i)
    return corpus

We will plot the top 10 words in non-disaster tweets.

stop_words=set(stopwords.words('english'))
corpus0 = create_corpus(0)
dic = defaultdict(int)for word in corpus0:
    if word in stop_words:
        dic[word] += 1top = sorted(dic.items(), key = lambda x: x[1], reverse=True)[:10]plt.figure(figsize=(10, 5))
x, y = zip(*top)
plt.bar(x, y, color='#3398FF')
plt.title('Top 10 stop words in Non-Disaster Tweets', fontsize=20)

We will plot the top 10 words in disaster tweets.

corpus1 = create_corpus(1)
dic1 = defaultdict(int)for word in corpus1:
    if word in stop_words:
        dic1[word] += 1top1 = sorted(dic1.items(), key = lambda x: x[1], reverse=True)[:10]plt.figure(figsize=(10, 5))
x, y = zip(*top1)
plt.bar(x, y, color='#FF5E33')
plt.title('Top 10 stop words in Disaster Tweets', fontsize=20)

Let’s find the common words in the overall corpus.

counter = Counter(corpus0+corpus1)
most = counter.most_common()
x=[]
y=[]for word, count in most[:60]:
    if (word not in stop_words) :
        x.append(word)
        y.append(count)plt.figure(figsize=(7, 7))
sns.barplot(x=y, y=x)
plt.title('Common words in Overall Corpus', fontsize=20)
plt.show()

Below, we will define a function to get top n-grams. An n-gram model is a probabilistic language model for predicting the next items in a sequence.

def get_top_ngrams(corpus,n_grams, n=None):
    vec = CountVectorizer(ngram_range=(n_grams,n_grams)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in   vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]

Let’s find the top 10 bigrams in the overall corpus.

plt.figure(figsize=(7, 7))
top_tweet_bigrams = get_top_ngrams(df['text'], 2, 10)
x, y = map(list, zip(*top_tweet_bigrams))
plt.title('Top Bigrams in Overall Corpus', fontsize=20)
sns.barplot(x=y, y=x)

Let’s find the top 10 trigrams in the overall corpus.

plt.figure(figsize=(7, 7))
top_tweet_trigrams = get_top_ngrams(df['text'], 3, 10)
x, y = map(list, zip(*top_tweet_trigrams))
plt.title('Top Trigrams in Overall Corpus', fontsize=20)
sns.barplot(x=y, y=x)

We’ll define a function for creating a word cloud. We will then segregate the non-disaster & disaster tweets and create word clouds for them.

def wordcloud_draw(data, colormap):
    words = ' '.join(data)
    wordcloud = WordCloud(stopwords=stopwords.words('english'),
    colormap=colormap,
    width=2500,
    height=2000
    ).generate(words)
    plt.figure(1,figsize=(10, 7))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()non_disaster_tweets = df[df['target'] == 0]
non_disaster_tweets =non_disaster_tweets['text']
disaster_tweets = df[df['target'] == 1]
disaster_tweets= disaster_tweets['text']print("For Non-Disaster tweets")
wordcloud_draw(non_disaster_tweets,colormap='Wistia')
print("For Disaster tweets")
wordcloud_draw(disaster_tweets,colormap='tab20c')

Preparing the Data

The create_corpus function will create a corpus using all the tweets in train and test data.

def create_corpus(df):
    corpus=[]
    for tweet in df['text']:
        corpus.append(tweet)
    return corpuscorpus=create_corpus(df)

GloVe stands for “Global Vectors” which capture global and local statistics of a corpus. We’ll create an embedding dictionary using gloVe. We will also convert texts to sequences and pad them.

embedding_dict={}with open('glove/glove.6B.100d.txt','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(corpus)
sequences=tokenizer_obj.texts_to_sequences(corpus)
tweet_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

Let‘s see how many unique words do we have in the corpus.

word_index=tokenizer_obj.word_index
print('Number of unique words:',len(word_index))

We’ll now create an embedding matrix.

num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,100))for word,i in word_index.items():
    if i > num_words:
        continue
    emb_vec=embedding_dict.get(word)
    if emb_vec is not None:
        embedding_matrix[i]=emb_vec

Creating the Model

It’s time to create our model. We’ll use our embedding_matrix and LSTM to create the model.

model=Sequential()embedding=Embedding(num_words,100,embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_LEN,trainable=False)model.add(embedding)
model.add(SpatialDropout1D(0.2))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))optimzer=Adam(learning_rate=1e-5)model.compile(loss='binary_crossentropy',optimizer=optimzer,metrics=['accuracy'])

We’ll extract the data back for training and testing.

train_final=tweet_pad[:train.shape[0]]
test_final=tweet_pad[train.shape[0]:]

The train_final will further be split for training and validation purposes.

X_train,X_test,y_train,y_test=train_test_split(train_final,train['target'].values,test_size=0.15)
print('Shape of train',X_train.shape)
print("Shape of Validation ",X_test.shape)

Let’s fit our model now. We can see in the output that our model achieves a validation accuracy of 0.798 after the last (15th) epoch.

history=model.fit(X_train,y_train,batch_size=4,epochs=15,validation_data=(X_test,y_test),verbose=2)

Prediction

Let’s make a prediction for the test data now. This output shows the prediction containing decimal values.

y_pre = model.predict(X_test)
print(y_pre)

We will round off the prediction and reshape it. As you can see in the output, the prediction is now in form of an array containing 0's and 1's. A 1 represents disaster and 0 represents not a disaster.

y_pre = np.round(y_pre).astype(int).reshape(len(y_pre))
print(y_pre)

We are done with the Disaster Tweets Classification using gloVe. Happy learning!