Machine learning on Texts & Generative vs. discriminative modeling

Yudi Xu
Just another data scientist
5 min readSep 19, 2018

Have you ever been wondering how to do a quick machine learning analysis with text data? For example, sentiment analysis, or to predict whether certain things can happen based on historic text descriptions.

Thanks to Mogady from Kaggle, he provided a nice (easy and clean) dataset can just meet my interest to try some quick Text machine learning analytics. I also used some of his code :-P

dataset of kickstarter project (description and results) print first 5 lines of data

If your data look like the one above, congrats man! You can do a quick machine learning to predict the state in 5 min.

So, we have only two useful columns in the pic. Blurb: description of the kickstarter project. State: results of the projects (failed or successful).

I am using Pandas to handle the dataset (the most commmon way if with Python). To play with a data frame, pls never use a for or while loop… your computer will run out of memory :-(

Trick 1

Use lambda apply function, NOT for loop.

apply functions to remove punctuations from text in each row

Used a bit regular expression, won’t explain it too much here, you can use my colab sharing directly.

lambda funciton: it looks always very similar, df[‘column’].apply(lambda x: function(x)). df[‘column’] is the column in dataframe you want to modify. x is just a variable, can be anything.

Machine learning model cannot take text data directly, hence, we need to transform the text into numeric numbers. There are a few methods we can use. Thanks to Jason Brownlee, I introduce 3 ways of preparing text data for machine learning. See article here

1 . Word count with CountVectorizer

to use countVectorizer, we need to remove some stop words (a, the,an, etc). Otherwise it will count anything in there, which is no good for the result.

2. Word frequency with TfidfVectorizer

This is an acronym than stands for “Term Frequency — Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.

Inverse Document Frequency: This downscales words that appear a lot across documents.

3. Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

We used the 2nd method (TfidfVectorizer) just for demo purpose.

transform text into frequency

we also use the “train_test_split” function to split the data into training and testing dataset.

Trick 2

To make your model training more managable, I recommend to write a simple list and function to automate training for multiple models at once. Also thanks this to Mogady, I need to build this behavior as well:-)

model training

write a list entries = [] to store all the prediction result for display purpose.

it’s ok to use for loop, we only loop the number of models we want to test. Here we only test two models: LogisticRegression and Multinominal Naive Bayes.

The reason to pick up these two models is because another topic in this article: generative model vs. discriminative model.

Generative model VS. Discriminative Model

If you don’t know these two, first thing you may do is Googling, I did the same. Found various answers. I made some conclusion, and have interpreted the mathmatics jargons into human language :-)

  1. So both are statistical calssification approaches.
  2. Generative modeling is to train a function f(x,y) with a score that determined by y and x together. Then we can predict y with a given x in the funciton f(x,y)with the bigges score number of course. This means you know how the model is actually built, and the result y can be easily explained with the function and given x. Navie Bayes is a good example of generative model. It’s not drawing a line, but to accurately mark each class of the data.
  3. Discriminative model is to train a function f(x,y) based on given a x, and find the most likely y. Basically, the model learn to discriminate classes by draw a single line. Even if you’re able to classify the data very accurately, you have no notion of how the data might have been generated. Netural network, logisticRegression, etc are discripminative modeling. If with a large dataset, discriminative models are proven to be more accurate than genenrative models.
best explanation pic I have found :-)

So we tested logisticRegression and Naive Bayes:

model accuracy and visualization

Accuracy: [(‘LogisticRegression’, 0.6809270816416491), (‘Multinomial’, 0.6732710020184209)]

model accuracy visualization

Although there is no big difference on the picture, however for such dataset, around 70% of accuracy is already quite high, and a small improvement of accuracy is very difficult.

Code is here!

Sharing my Colab code: https://colab.research.google.com/drive/1QXD2HtMhnRX1GZaqc-UBkDuBKODGzz5x

pls download the data from here: https://drive.google.com/file/d/1dedHvEms8d00ZPYQ4Tx6z0fzLIJdRrvC/view?usp=sharing

Author: Yudi Xu

CTO Consultant for Emerging digital technology

Project manager and expert in IoT

Former data scientist in Shell

Linkedin: https://www.linkedin.com/in/xuyudi/

--

--

Yudi Xu
Just another data scientist

Not really a data scientist, but I want to become just another data scientist (JADS), that’s why the project together with my best data scientist friends:)