Quick & Dirty Sentiment

A short write up for quick and dirty sentiment classification and models.


  1. Python
  2. Data with Labels

Basic idea about what is being done:

  1. Make word vectors out of corpus text
  2. Use word vectors to form sentence/document vector
  3. Apply ML classifiers to the word vectors to build a model
  4. Predict

I will be making use of data set found on kaggle, you can download it here. Since this is a quick and dirty approach we will omit the text preprocessing part such as stop words removal etc. But just a reminder

In Natural Language Processing (NLP) 90% of the work is preprocessing — Some Researcher

Moving on…

Lets start by reading the data from the csv/tsv file.

# Read csv file using pandas
train = pd.read_csv("data/train.csv")
# Create an empty list to store the tokens
corpus = []
for p in train.Phrase:

Once the data is read we will store the data in a list (corpus). The next step would be to create word vectors we will be using gensim package for the purpose. Have a look at the below code

# Create a word vector model with vectors of dimensions 25
model = Word2Vec(corpus, min_count=1, size=25)
# Save the model in a file

The vector space model will give us vectors that would look something like this:

# Output showing vector for word 'escapades'
In [35]: model["escapades"]
array([ 0.01912756,  0.11313001,  0.05706277,  0.05470243, -0.07171227,
-0.00395091,  0.01398386,  0.01066697,  0.01835706,  0.16320878,
-0.09950776,  0.02733932,  0.0118545 , -0.00124337,  0.02434457,
-0.11922658, -0.00507172, -0.12057459, -0.00341248, -0.01090243,
-0.00488957,  0.0275436 , -0.0614472 ,  0.05964575, -0.00052632], dtype=float32)

The vector is nothing but a mathematical representation of the occurrence of the word ‘escapades’ in the corpus.

The best part about having word vectors is we can play with them, simply by adding or subtracting or performing various algebraic operation between multiple vectors.

The next step is to build document vectors there can be many ways to do this, I will simply be adding the vectors to for a document vector.

For e.g. “This is sentence” we will simply add the individual vectors for each word. As seen below

Now store these in some form of dictionary.

pvecs = dict()
for r in train.iterrows():
sid = r[1]["SentenceId"]
phvec = sum([model[x] for x in r[1]["Phrase"].split()])
pvecs[sid] = phvec

The above code does the job of calculating the sentence/phrase vectors and store those in a dictionary called pvecs.

Sentence/Phrase Vectors Stored in pvecs

After we have pvecs lets create a data frame which is more intuitive to a data scientist with features and label column. The below code does the job of creating the pvdf data frame.

# Convert the dictionary into a dataframe
pvdf = pd.DataFrame.from_dict(pvecs, orient='index')
# Rename columns
pvdf.columns = ["feat_"+str(x) for x in range(1,26)]
# Add sentiment lable to pvdf
pvdf["label"] = train.Sentiment

The data frame will look something like this.

Final Data Frame with Columns as Features and Label

Now we can use our favourite python package ‘scikit learn’ to build a classification model. Lets do that…

# Define and train a classifier
clf = svm.SVC()
clf.fit(pvdf[pvdf.columns[:25]], pvdf.label)

That feels good ain’t it? Awesome, our classifier is trained, now time to test it. Lets get our test dataset out.

Before we do that there is one thing we missed we had our labels as 0,1,2,3.. but we never discussed what they meant here,

0 — negative
1 — somewhat negative
2 — neutral
3 — somewhat positive
4 — positive

Time to test, the below code takes a test phrase and classifies its sentiment.

# test.Phrase[0]
# 'An intermittently pleasing but mostly routine effort .'
# Lets create a phrase vector 
y = pd.DataFrame(sum([model[x] for x in test.Phrase[0].split()])).transpose()
# Make the final prediction
# outputs 2 => neutral

You can go ahead and try out a few more examples. The code can be found on github here.

That's all folks!