A Universe of Learning: Part 1

It might not seem so, but, believe it or not, there is more to life than just deep learning. There is, in fact, a whole universe of machine learning, from probabilistic methods (Bayesian networks, Markov chains) to deterministic modeling (logistic regression, support vector machines) to ensemble learning (random forests, stacking). If we want to be the best engineers and data scientists that we are capable of being, then we need to discover the whole universe, learn about the pros and cons of every tool in the toolbox, and understand which problem each tool is best at solving.

Thanks to a large (and growing) number of easy-to-use frameworks, it has become rather trivial to play in this universe. Scikit, mllib, NLTK, PyBrain, and a million other packages will let you plumb data through any learner with just a few lines of python. A competent developer could get something working, however inaccurate or insensitive it may be, within mere minutes. But to understand how to optimize, and why you might choose a simple option, like naive Bayes, over something more complex, like bucket of models, you need to consider the benefits and detractors for each option, and how they might be relevant to your particular problem or goal.

So, let’s dig in, one learning method at a time, starting with:

Naive Bayes

This learner is based on the Bayes Theorem of probability (shown below), and is beloved for data classification, trained against very large data sets.

It is fast, efficient, performs very well for multi-class prediction, assumes features are unrelated and independent, and works well with non-numeric inputs. However, it is, of course, not perfect; this learner also suffers from the Zero Frequency Problem (it is entirely dependent on observed values in the training data in order to make predictions), is not a particularly good estimator (pay little attention to the probability scores), and requires careful feature engineering to benefit from feature independence.

The math and procedure behind naive Bayes is pretty straightforward: generate a frequency table (which will show how often each feature is seen in the training data), create a table of probabilities for each feature (how often each feature relates to the total number of training samples), then use those numbers as variables in the formula to calculate the probability of each class. As an example, let’s look at how we might decide if an e-mail is likely to be spam based on the existence of the word “prince…”

# assume a base probability of spam at 20%
# how many times does the word "prince" 
# appear in spam and non-spam emails...
"prince"    |  Yes  |  No  |  Total
-------------------------------------
spam | 6 | 12 | 18
non-spam | 1 | 23 | 24
# what is the likelihood of the word "prince"
# in a spam or non-spam e-mail...
"prince"    |  Yes  |  No   |  Total 
-------------------------------------
spam | 0.33 | 0.67 | 18
non-spam | 0.04 | 0.96 | 24
-------------------------------------
TOTAL | 0.17 | 0.83 | 1.0
# now, plug those values into the formula...
P(spam|"prince") = (0.33 x 0.2) / 0.17
P(spam|"prince") = 38.83%

Naive Bayes models are known as “eager” offline learners, which means they are trained before use, and are always trying to construct a generalized, feature-dependent understanding of the world. They are somewhat resistant to noisy data (which is why they are well liked for extremely large data sets), and are therefore used on problems that are at risk of such noise: recommendation engines, sentiment analysis, text classification, etc. If you have a lot of data, and a discreet set of possible answers, and you want to make predictions quickly in real-time, this might be a great learner for you.

Let’s take a look at some sample code: using just two files (sentiment-neg.txt and sentiment-pos.txt), scikit-learn, and a little python, you can create a very simple sentiment classification model (positive or negative).

import re
from random import shuffle
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
sentences = []
# get all negative sentences...
with open("sentiment-neg.txt", 'r') as file:
lines = file.readlines()
for line in lines:
line = re.sub(r'[^\x00-\x7F]','',line)
sentences.append([
line.strip(),
'neg'
])
# get all positive sentences...
with open("sentiment-pos.txt", 'r') as file:
lines = file.readlines()
for line in lines:
line = re.sub(r'[^\x00-\x7F]','',line)
sentences.append([
line.strip(),
'pos'
])
shuffle(sentences)
test_threshold = int(len(sentences)*.25)
tests = sentences[:test_threshold]
sentences = sentences[test_threshold:]
vectorizer = CountVectorizer(stop_words='english')
training_set = vectorizer.fit_transform([r[0] for r in sentences])
test_set = vectorizer.transform([r[0] for r in tests])
nb = MultinomialNB()
nb.fit(training_set, [r[1] for r in sentences])
predictions = nb.predict(test_set)
for i, prediction in enumerate(predictions[:10]):
print prediction+': '+tests[i][0]
# ----------------
# OUTPUT
# ----------------
# neg: unfortunately, heartbreak hospital wants to convey...
# neg: the film feels formulaic, its plot and pacing typical...
# pos: there is a kind of attentive concern that hoffman brings...
# pos: one fantastic (and educational) documentary.
# neg: the movie is a negligible work of manipulation...
# pos: my thoughts were focused on the characters...
# pos: warm and exotic.
# neg: has all the right elements but completely fails...
# neg: ringu is a disaster of a story, full of holes...
# pos: cedar takes a very open-minded approach to this...

This particular classifier uses word counts as scoring vectors in order to determine sentiment classification, but you might decide to test more complex features (perhaps n-grams, first or last words, only verbs, etc). Those decisions I will leave to the engineer; experiment and test!


Next time, we’ll take a look at Markov chains!