A comprehensive Naive Bayes Tutorial using scikit-learn

Awantik Das
8 min readSep 25, 2018

--

Agenda

  1. Introduction Bayes’ Theorm
  2. Naive Bayes Classifier
  3. Gaussian Naive Bayes
  4. Multinomial Naive Bayes
  5. Burnolis’ Naive Bayes
  6. Naive Bayes for out-of-core

Introduction to Naive Bayes

  • The Naive Bayes Classifier technique is based on the Bayesian theorem and is particularly suited when then high dimensional data.
  • It’s simple & out-performs many sophisticated methods
  • Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.
  • The above assumption is very strong & not true for in real situations, still naive bayes works quite well

Class Probabilities

  • For Bi-class classification, P(Class 1) = Count(Class 1) / Count( Class 1 + Class 2)

Conditional Probabilities

  • Frequency of each attribute value for each class
  • Consider a dataset with attribute — weather ( values — sunny & rainy ). Target — Sports ( values — chess & tennis )
  • P(weather=sunny|target=tennis) = Count ( weather=sunny & target=tennis ) / Count ( target=tennis )

Naive Bayes’ Classifier

  • Formula : Prediction = Max(P(feature|h).P(h))
  • Let’s predict for a new data (weather=sunny)
  • Possibility of tennis = P(weather=sunny|target=tennis) . P(target=tennis)
  • Possibility of chess = P(weather=sunny|target=chess) . P(target=chess)
  • We choose the possibility with higher values
  • Normalize the value to bring it to scale of 0 to 1

More features

  • In case, we add more feature like skill (values — low,moderate,high)
  • Our probability becomes, P(weather=sunny|target=tennis).P(skill=moderate|target=tennis).P(target=tennis)

Gaussian Naive Bayes

  • The above fundamental example is for categorical data
  • We can use Naive Bayes for continues data as well
  • Assumption is data should be of Gaussian Distribution
  • Let’s understand a bit about Gaussian PDF
  • Possibility of tennis = P(pdf(precipitation)|class=tennis) . P(pdf(windy)|class=tennis) . P(class=tennis
  • Prior probability can be configured. By default, each class is assigned equal probability

In [103]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns; sns.set(color_codes=True)
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline

In [104]:

iris = load_iris()

In [105]:

df = pd.DataFrame(iris.data, columns=iris.feature_names)

In [106]:

from sklearn.naive_bayes import GaussianNB

In [107]:

gnb = GaussianNB()

In [108]:

gnb.fit(df,iris.target)

Out[108]:

GaussianNB(priors=None)

In [109]:

gnb.score(df,iris.target)

Out[109]:

0.96

Multinomial Naive Bayes

  • Suited for classification of data with discrete features ( count data )
  • Very useful in text processing
  • Each text will be converted to vector of word count
  • Cannot deal with negative numbers

In [319]:

review_data = pd.read_csv('Reviews.csv')

In [320]:

review_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
Id 568454 non-null int64
ProductId 568454 non-null object
UserId 568454 non-null object
ProfileName 568438 non-null object
HelpfulnessNumerator 568454 non-null int64
HelpfulnessDenominator 568454 non-null int64
Score 568454 non-null int64
Time 568454 non-null int64
Summary 568427 non-null object
Text 568454 non-null object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB

In [321]:

review_data.head()

Out[321]:

IdProductIdUserIdProfileNameHelpfulnessNumeratorHelpfulnessDenominatorScoreTimeSummaryText01B001E4KFG0A3SGXH7AUHU8GWdelmartian1151303862400Good Quality Dog FoodI have bought several of the Vitality canned d…12B00813GRG4A1D87F6ZCVE5NKdll pa0011346976000Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut…23B000LQOCH0ABXLMWJIXXAINNatalia Corres “Natalia Corres”1141219017600"Delight” says it allThis is a confection that has been around a fe…34B000UA0QIQA395BORC6FGVXVKarl3321307923200Cough MedicineIf you are looking for the secret ingredient i…45B006K2ZZ7KA1UQRSCLF8GW1TMichael D. Bigham “M. Wassir”0051350777600Great taffyGreat taffy at a great price. There was a wid…

In [199]:

review_data = review_data[['Text','Score']]

In [200]:

review_data = review_data[review_data.Score != 3]

In [201]:

review_data['Sentiment'] = review_data.Score.map(lambda s:0 if s < 3 else 1)

In [202]:

review_data.drop('Score',axis=1,inplace=True)

In [203]:

review_data.head()

Out[203]:

TextSentiment0I have bought several of the Vitality canned d…11Product arrived labeled as Jumbo Salted Peanut…02This is a confection that has been around a fe…13If you are looking for the secret ingredient i…04Great taffy at a great price. There was a wid…1

In [204]:

review_data.Sentiment.value_counts()

Out[204]:

1    443777
0 82037
Name: Sentiment, dtype: int64

In [205]:

review_data = review_data.sample(10000)

Remove punchuations

In [206]:

from nltk.tokenize import RegexpTokenizer

In [207]:

tokenizer = RegexpTokenizer(r'[A-Za-z]+')

In [208]:

review_data['Text'] = review_data.Text.map(lambda x:tokenizer.tokenize(x))

In [209]:

review_data.Text

Out[209]:

443278    [I, have, tried, many, tahini, types, Not, a, ...
566427 [To, be, fair, I, m, not, a, fan, of, boxed, M...
31468 [So, it, s, Chai, If, you, don, t, know, Chai,...
11772 [I, love, these, chopped, walnuts, I, put, the...
514100 [One, of, the, kitties, has, a, very, sensitiv...
173711 [These, are, one, of, the, best, dog, treats, ...
130529 [chips, cousin, teddy, eats, canidae, dog, foo...
385763 [I, do, love, these, products, but, why, so, e...
380255 [FIRST, Let, me, tell, you, that, I, am, the, ...
141634 [I, am, waiting, for, my, box, as, I, type, th...
40785 [This, is, my, favorite, tea, flavor, it, s, h...
330596 [Caribou, Blend, is, my, favorite, of, all, th...
297516 [What, else, can, I, say, They, are, great, he...
88718 [I, bought, dried, milks, Peaks, dry, Whole, M...
73577 [We, converted, to, Rice, Dream, over, a, year...
499867 [My, cat, is, picky, He, also, gets, an, upset...
493453 [Cadbury, eggs, are, one, of, the, joys, of, E...
164829 [These, things, are, great, They, remind, me, ...
34614 [When, you, start, one, bag, isn, t, too, long...
488787 [A, very, enjoyable, sweet, treat, love, the, ...
260951 [bye, bye, soda, hello, bai, jamaica, blueberr...
364099 [Got, a, case, of, this, for, with, free, ship...
447763 [was, very, disappointd, with, this, soup, at,...
207784 [It, was, cute, but, I, didn, t, realize, it, ...
26295 [I, ve, purchased, several, Douwe, Egberts, co...
80958 [yum, I, haven, t, had, these, for, years, the...
142354 [I, usually, have, trouble, with, the, acid, i...
528729 [Imagine, you, are, in, a, snow, globe, But, i...
502367 [It, seems, like, they, have, tried, to, impro...
58315 [We, absolutely, love, this, coffee, It, s, a,...
...
2859 [I, bought, this, at, a, local, gas, station, ...
438998 [TOP, REASONS, NOT, TO, BUY, THIS, SAUCE, br, ...
8582 [As, coffee, in, general, this, Wolfgang, vari...
428387 [If, this, were, just, a, good, tasting, bette...
387352 [This, stuff, is, amazing, I, first, tried, it...
186330 [I, bought, this, product, based, on, many, po...
462425 [Bottled, this, up, weeks, ago, went, to, open...
95875 [I, have, a, year, old, who, was, born, allerg...
46292 [I, purchased, this, and, couldn, t, be, more,...
343847 [Both, my, husband, and, I, love, the, taste, ...
189373 [Nutiva, Extra, Virgin, Coconut, Oil, tastes, ...
125000 [When, I, first, wrote, this, review, for, som...
190775 [I, bought, two, of, these, directly, from, Ae...
83412 [The, subscribe, save, service, is, good, prod...
263121 [This, used, to, be, the, only, cat, food, my,...
132539 [You, get, a, lot, for, your, money, Has, all,...
351671 [The, sweet, and, tart, flavor, of, the, limes...
381370 [I, bought, this, product, and, love, the, fla...
544952 [I, am, an, avid, home, cook, and, purchased, ...
391933 [peach, flavor, could, be, stronger, I, use, i...
383855 [To, the, person, who, states, that, the, pric...
214966 [I, was, happy, to, be, able, to, get, this, a...
281938 [My, boyfriend, bought, this, for, me, as, a, ...
182924 [The, food, this, company, makes, is, the, clo...
416719 [My, cat, allergic, to, any, food, I, tried, t...
344986 [A, hot, chocolate, packet, was, the, first, K...
3815 [Panda, All, Natural, Soft, Licorice, is, a, g...
215458 [I, ordered, packs, of, Izze, Fortified, All, ...
503869 [It, taste, like, a, warm, soft, chocolate, ch...
428210 [I, sent, this, to, my, sister, for, her, birt...
Name: Text, Length: 10000, dtype: object

Stemming

In [210]:

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [212]:

review_data['Text'] = review_data.Text.map(lambda l: [stemmer.stem(word) for word in l])

In [214]:

review_data.Text = review_data.Text.str.join(sep=' ')

Preprocessing

In [215]:

from sklearn.feature_extraction.text import CountVectorizer

In [216]:

cv = CountVectorizer(stop_words='english')

In [217]:

review_data_tf = cv.fit_transform(review_data.Text)

Splitting data into train_test

In [219]:

trainX,testX,trainY,testY = train_test_split(review_data_tf,review_data.Sentiment)

Create Model

In [221]:

review_data.Sentiment.value_counts()

Out[221]:

1    8463
0 1537
Name: Sentiment, dtype: int64
  • Class Imbalanced
  • Two ways to handle this — dealing with data, dealing with algo

In [222]:

from sklearn.naive_bayes import MultinomialNB

In [282]:

mnb = MultinomialNB(class_prior=[.25,.75])

In [283]:

mnb.fit(trainX,trainY)

Out[283]:

MultinomialNB(alpha=1.0, class_prior=[0.25, 0.75], fit_prior=True)

In [284]:

mnb.class_prior

Out[284]:

[0.25, 0.75]

In [285]:

y_pred = mnb.predict(testX)

In [286]:

from sklearn.metrics import confusion_matrix

In [287]:

confusion_matrix(y_true=testY, y_pred=y_pred)

Out[287]:

array([[ 180,  197],
[ 90, 2033]], dtype=int64)

Bernoulli’s Naive Bayes

  • Like MultinomialNB, this classifier is suitable for discrete data.
  • The difference is that while MultinomialNB works with occurrence counts
  • BernoulliNB is designed for binary/boolean features.
  • If data is not binary, internally Binarization preprocessing will happen
  • Can deal with negative numbers

In [84]:

from sklearn.datasets import make_classification

In [85]:

X, Y = make_classification(n_samples=500, n_features=2, n_informative=2, n_redundant=0)

In [86]:

plt.scatter(X[:,0],X[:,1],c=Y,s=10, cmap='viridis')

Out[86]:

<matplotlib.collections.PathCollection at 0x1bad88e2d30>

In [87]:

from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.model_selection import train_test_split

In [88]:

trainX,testX,trainY,testY = train_test_split(X,Y)

In [95]:

bnb = BernoulliNB(binarize=0.0)
mnb = MultinomialNB()

In [96]:

bnb.fit(trainX, trainY)
#mnb.fit(trainX, trainY)

Out[96]:

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [97]:

bnb.score(testX,testY)

Out[97]:

0.952

In [98]:

#mnb.score(testX,testY)

In [99]:

h = .02
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

In [100]:

Z = bnb.predict(np.c_[xx.ravel(), yy.ravel()])

In [101]:

Z = Z.reshape(xx.shape)

In [102]:

plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
plt.scatter(X[:,0],X[:,1],c=Y,s=10)

Out[102]:

<matplotlib.collections.PathCollection at 0x1bad893ed68>

Out-of-core training

  • Naive Bayes supports partial_fit function
  • For data which cannot be fit to RAM, we can use partial_fit function to gradually train the model

In [290]:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[A-Za-z]+')

In [291]:

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
  • HashingVectorizer is suited for large data, since it doesn’t maintain state

In [294]:

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,
alternate_sign=False)

In [316]:

review_data_chunks = pd.read_csv('Reviews.csv', chunksize=20000)

In [311]:

test = pd.read_csv('Reviews.csv').sample(10000)

In [313]:

test = test[['Text','Score']]
test = test[test.Score != 3]
test['Sentiment'] = test.Score.map(lambda s:0 if s < 3 else 1)
test.Text = test.Text.map(lambda x:tokenizer.tokenize(x))
test.Text = test.Text.map(lambda l: [stemmer.stem(word) for word in l])
test.Text = test.Text.str.join(sep=' ')
test_tf = vectorizer.transform(test.Text)

In [314]:

mnb = MultinomialNB(class_prior=[.22,.78])
  • Taking chunk of data each time & fitting the model & gradually improving it

In [317]:

for idx,review_data in enumerate(review_data_chunks):
print ('iter : ',idx)
review_data = review_data[['Text','Score']]
review_data = review_data[review_data.Score != 3]
review_data['Sentiment'] = review_data.Score.map(lambda s:0 if s < 3 else 1)
review_data.Text = review_data.Text.map(lambda x:tokenizer.tokenize(x))
review_data.Text = review_data.Text.map(lambda l: [stemmer.stem(word) for word in l])
review_data.Text = review_data.Text.str.join(sep=' ')
text_tf = vectorizer.transform(review_data.Text)
mnb.partial_fit(text_tf,review_data.Sentiment,classes=[0,1])
y_pred = mnb.predict(test_tf)
print (confusion_matrix(y_pred=y_pred, y_true=test.Sentiment))
iter : 0
[[ 3 1434]
[ 1 7781]]
iter : 1
[[ 4 1433]
[ 1 7781]]
iter : 2
[[ 4 1433]
[ 1 7781]]
iter : 3
[[ 5 1432]
[ 1 7781]]
iter : 4
[[ 7 1430]
[ 1 7781]]
iter : 5
[[ 7 1430]
[ 1 7781]]
iter : 6
[[ 8 1429]
[ 1 7781]]
iter : 7
[[ 9 1428]
[ 1 7781]]
iter : 8
[[ 9 1428]
[ 1 7781]]
iter : 9
[[ 9 1428]
[ 1 7781]]
iter : 10
[[ 9 1428]
[ 1 7781]]
iter : 11
[[ 10 1427]
[ 1 7781]]
iter : 12
[[ 11 1426]
[ 1 7781]]
iter : 13
[[ 13 1424]
[ 1 7781]]
iter : 14
[[ 14 1423]
[ 1 7781]]
iter : 15
[[ 15 1422]
[ 1 7781]]
iter : 16
[[ 15 1422]
[ 1 7781]]
iter : 17
[[ 17 1420]
[ 1 7781]]
iter : 18
[[ 17 1420]
[ 1 7781]]
iter : 19
[[ 18 1419]
[ 1 7781]]
iter : 20
[[ 17 1420]
[ 1 7781]]
iter : 21
[[ 18 1419]
[ 1 7781]]
iter : 22
[[ 19 1418]
[ 1 7781]]
iter : 23
[[ 20 1417]
[ 1 7781]]
iter : 24
[[ 20 1417]
[ 1 7781]]
iter : 25
[[ 23 1414]
[ 1 7781]]
iter : 26
[[ 24 1413]
[ 1 7781]]
iter : 27
[[ 24 1413]
[ 1 7781]]
iter : 28
[[ 25 1412]
[ 1 7781]]
  • As we can see, model keeps improving
  • Next step would be to use sampling techniques to improve further accuracy

--

--