Abdalsamad Keramatfar
2 min readSep 27, 2019

--

Text classification in just a few minutes!

In this post, i will show a very simple approach to implement a text classification task. As an special case of text classification i will focus on Sentiment Analysis, i.e. classification of a piece of text as positive or negative. for more information about sentiment analysis see here. We use a very popular data set in sentiment analysis literature from here.

So first we import libraries that we need:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Then we need to load our data and some proccessing:

train = pd.read_csv(‘C:/Users/G10/Desktop/gcn/code/data/hcr/train/orig/hcr-train.csv’)
train[‘sentiment’] = train[‘sentiment’].str.strip()
dev = pd.read_csv(‘C:/Users/G10/Desktop/gcn/code/data/hcr/dev/orig/hcr-dev.csv’)
dev[‘sentiment’] = dev[‘sentiment’].str.strip()
test = pd.read_csv(‘C:/Users/G10/Desktop/gcn/code/data/hcr/test/orig/hcr-test.csv’)
test[‘sentiment’] = test[‘sentiment’].str.strip()
test.head()

then:

data = train.append(dev).append(test)
ndata = data[(data[‘sentiment’].notnull())&(data[‘sentiment’] != ‘unsure’)&(data[‘sentiment’] != ‘neutral’)&(data[‘sentiment’] != ‘irrelevant’)]
len(data), len(ndata)

The above code just appends all parts of data and filters out undesired parts. Now, we split our data to train and test:

X_train, X_test, y_train, y_test = train_test_split(
ndata[‘content’], ndata[‘sentiment’], test_size=0.2, random_state=42)

As, the machine can not read directly text we need to represent it in some way that it can. Bag of Words representation is a very popular approach in this line. Also we enrich that by TFIDF value of each feature. That gives more weight to features that are common in the specific tweet but not in all. so:

vectorizer = TfidfVectorizer()
vX_train = vectorizer.fit_transform(X_train)
vX_test = vectorizer.transform(X_test)
print(X.shape)

Now, We can do the main stuff; training the classifier:

clf = LogisticRegression().fit(vX_train, y_train)

Finally, we need to test our trained classifier:

preds = clf.predict(vX_test)
accuracy_score(y_test, preds)

Done! 0.7610. Not bad. You can compare this result with a new published paper!

--

--