A comprehensive Naive Bayes Tutorial using scikit-learn
Agenda
- Introduction Bayes’ Theorm
- Naive Bayes Classifier
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Burnolis’ Naive Bayes
- Naive Bayes for out-of-core
Introduction to Naive Bayes
- The Naive Bayes Classifier technique is based on the Bayesian theorem and is particularly suited when then high dimensional data.
- It’s simple & out-performs many sophisticated methods
- Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.
- The above assumption is very strong & not true for in real situations, still naive bayes works quite well
Class Probabilities
- For Bi-class classification, P(Class 1) = Count(Class 1) / Count( Class 1 + Class 2)
Conditional Probabilities
- Frequency of each attribute value for each class
- Consider a dataset with attribute — weather ( values — sunny & rainy ). Target — Sports ( values — chess & tennis )
- P(weather=sunny|target=tennis) = Count ( weather=sunny & target=tennis ) / Count ( target=tennis )
Naive Bayes’ Classifier
- Formula : Prediction = Max(P(feature|h).P(h))
- Let’s predict for a new data (weather=sunny)
- Possibility of tennis = P(weather=sunny|target=tennis) . P(target=tennis)
- Possibility of chess = P(weather=sunny|target=chess) . P(target=chess)
- We choose the possibility with higher values
- Normalize the value to bring it to scale of 0 to 1
More features
- In case, we add more feature like skill (values — low,moderate,high)
- Our probability becomes, P(weather=sunny|target=tennis).P(skill=moderate|target=tennis).P(target=tennis)
Gaussian Naive Bayes
- The above fundamental example is for categorical data
- We can use Naive Bayes for continues data as well
- Assumption is data should be of Gaussian Distribution
- Let’s understand a bit about Gaussian PDF
- Possibility of tennis = P(pdf(precipitation)|class=tennis) . P(pdf(windy)|class=tennis) . P(class=tennis
- Prior probability can be configured. By default, each class is assigned equal probability
In [103]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns; sns.set(color_codes=True)
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline
In [104]:
iris = load_iris()
In [105]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)
In [106]:
from sklearn.naive_bayes import GaussianNB
In [107]:
gnb = GaussianNB()
In [108]:
gnb.fit(df,iris.target)
Out[108]:
GaussianNB(priors=None)
In [109]:
gnb.score(df,iris.target)
Out[109]:
0.96
Multinomial Naive Bayes
- Suited for classification of data with discrete features ( count data )
- Very useful in text processing
- Each text will be converted to vector of word count
- Cannot deal with negative numbers
In [319]:
review_data = pd.read_csv('Reviews.csv')
In [320]:
review_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
Id 568454 non-null int64
ProductId 568454 non-null object
UserId 568454 non-null object
ProfileName 568438 non-null object
HelpfulnessNumerator 568454 non-null int64
HelpfulnessDenominator 568454 non-null int64
Score 568454 non-null int64
Time 568454 non-null int64
Summary 568427 non-null object
Text 568454 non-null object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB
In [321]:
review_data.head()
Out[321]:
IdProductIdUserIdProfileNameHelpfulnessNumeratorHelpfulnessDenominatorScoreTimeSummaryText01B001E4KFG0A3SGXH7AUHU8GWdelmartian1151303862400Good Quality Dog FoodI have bought several of the Vitality canned d…12B00813GRG4A1D87F6ZCVE5NKdll pa0011346976000Not as AdvertisedProduct arrived labeled as Jumbo Salted Peanut…23B000LQOCH0ABXLMWJIXXAINNatalia Corres “Natalia Corres”1141219017600"Delight” says it allThis is a confection that has been around a fe…34B000UA0QIQA395BORC6FGVXVKarl3321307923200Cough MedicineIf you are looking for the secret ingredient i…45B006K2ZZ7KA1UQRSCLF8GW1TMichael D. Bigham “M. Wassir”0051350777600Great taffyGreat taffy at a great price. There was a wid…
In [199]:
review_data = review_data[['Text','Score']]
In [200]:
review_data = review_data[review_data.Score != 3]
In [201]:
review_data['Sentiment'] = review_data.Score.map(lambda s:0 if s < 3 else 1)
In [202]:
review_data.drop('Score',axis=1,inplace=True)
In [203]:
review_data.head()
Out[203]:
TextSentiment0I have bought several of the Vitality canned d…11Product arrived labeled as Jumbo Salted Peanut…02This is a confection that has been around a fe…13If you are looking for the secret ingredient i…04Great taffy at a great price. There was a wid…1
In [204]:
review_data.Sentiment.value_counts()
Out[204]:
1 443777
0 82037
Name: Sentiment, dtype: int64
In [205]:
review_data = review_data.sample(10000)
Remove punchuations
In [206]:
from nltk.tokenize import RegexpTokenizer
In [207]:
tokenizer = RegexpTokenizer(r'[A-Za-z]+')
In [208]:
review_data['Text'] = review_data.Text.map(lambda x:tokenizer.tokenize(x))
In [209]:
review_data.Text
Out[209]:
443278 [I, have, tried, many, tahini, types, Not, a, ...
566427 [To, be, fair, I, m, not, a, fan, of, boxed, M...
31468 [So, it, s, Chai, If, you, don, t, know, Chai,...
11772 [I, love, these, chopped, walnuts, I, put, the...
514100 [One, of, the, kitties, has, a, very, sensitiv...
173711 [These, are, one, of, the, best, dog, treats, ...
130529 [chips, cousin, teddy, eats, canidae, dog, foo...
385763 [I, do, love, these, products, but, why, so, e...
380255 [FIRST, Let, me, tell, you, that, I, am, the, ...
141634 [I, am, waiting, for, my, box, as, I, type, th...
40785 [This, is, my, favorite, tea, flavor, it, s, h...
330596 [Caribou, Blend, is, my, favorite, of, all, th...
297516 [What, else, can, I, say, They, are, great, he...
88718 [I, bought, dried, milks, Peaks, dry, Whole, M...
73577 [We, converted, to, Rice, Dream, over, a, year...
499867 [My, cat, is, picky, He, also, gets, an, upset...
493453 [Cadbury, eggs, are, one, of, the, joys, of, E...
164829 [These, things, are, great, They, remind, me, ...
34614 [When, you, start, one, bag, isn, t, too, long...
488787 [A, very, enjoyable, sweet, treat, love, the, ...
260951 [bye, bye, soda, hello, bai, jamaica, blueberr...
364099 [Got, a, case, of, this, for, with, free, ship...
447763 [was, very, disappointd, with, this, soup, at,...
207784 [It, was, cute, but, I, didn, t, realize, it, ...
26295 [I, ve, purchased, several, Douwe, Egberts, co...
80958 [yum, I, haven, t, had, these, for, years, the...
142354 [I, usually, have, trouble, with, the, acid, i...
528729 [Imagine, you, are, in, a, snow, globe, But, i...
502367 [It, seems, like, they, have, tried, to, impro...
58315 [We, absolutely, love, this, coffee, It, s, a,...
...
2859 [I, bought, this, at, a, local, gas, station, ...
438998 [TOP, REASONS, NOT, TO, BUY, THIS, SAUCE, br, ...
8582 [As, coffee, in, general, this, Wolfgang, vari...
428387 [If, this, were, just, a, good, tasting, bette...
387352 [This, stuff, is, amazing, I, first, tried, it...
186330 [I, bought, this, product, based, on, many, po...
462425 [Bottled, this, up, weeks, ago, went, to, open...
95875 [I, have, a, year, old, who, was, born, allerg...
46292 [I, purchased, this, and, couldn, t, be, more,...
343847 [Both, my, husband, and, I, love, the, taste, ...
189373 [Nutiva, Extra, Virgin, Coconut, Oil, tastes, ...
125000 [When, I, first, wrote, this, review, for, som...
190775 [I, bought, two, of, these, directly, from, Ae...
83412 [The, subscribe, save, service, is, good, prod...
263121 [This, used, to, be, the, only, cat, food, my,...
132539 [You, get, a, lot, for, your, money, Has, all,...
351671 [The, sweet, and, tart, flavor, of, the, limes...
381370 [I, bought, this, product, and, love, the, fla...
544952 [I, am, an, avid, home, cook, and, purchased, ...
391933 [peach, flavor, could, be, stronger, I, use, i...
383855 [To, the, person, who, states, that, the, pric...
214966 [I, was, happy, to, be, able, to, get, this, a...
281938 [My, boyfriend, bought, this, for, me, as, a, ...
182924 [The, food, this, company, makes, is, the, clo...
416719 [My, cat, allergic, to, any, food, I, tried, t...
344986 [A, hot, chocolate, packet, was, the, first, K...
3815 [Panda, All, Natural, Soft, Licorice, is, a, g...
215458 [I, ordered, packs, of, Izze, Fortified, All, ...
503869 [It, taste, like, a, warm, soft, chocolate, ch...
428210 [I, sent, this, to, my, sister, for, her, birt...
Name: Text, Length: 10000, dtype: object
Stemming
In [210]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
In [212]:
review_data['Text'] = review_data.Text.map(lambda l: [stemmer.stem(word) for word in l])
In [214]:
review_data.Text = review_data.Text.str.join(sep=' ')
Preprocessing
In [215]:
from sklearn.feature_extraction.text import CountVectorizer
In [216]:
cv = CountVectorizer(stop_words='english')
In [217]:
review_data_tf = cv.fit_transform(review_data.Text)
Splitting data into train_test
In [219]:
trainX,testX,trainY,testY = train_test_split(review_data_tf,review_data.Sentiment)
Create Model
In [221]:
review_data.Sentiment.value_counts()
Out[221]:
1 8463
0 1537
Name: Sentiment, dtype: int64
- Class Imbalanced
- Two ways to handle this — dealing with data, dealing with algo
In [222]:
from sklearn.naive_bayes import MultinomialNB
In [282]:
mnb = MultinomialNB(class_prior=[.25,.75])
In [283]:
mnb.fit(trainX,trainY)
Out[283]:
MultinomialNB(alpha=1.0, class_prior=[0.25, 0.75], fit_prior=True)
In [284]:
mnb.class_prior
Out[284]:
[0.25, 0.75]
In [285]:
y_pred = mnb.predict(testX)
In [286]:
from sklearn.metrics import confusion_matrix
In [287]:
confusion_matrix(y_true=testY, y_pred=y_pred)
Out[287]:
array([[ 180, 197],
[ 90, 2033]], dtype=int64)
Bernoulli’s Naive Bayes
- Like MultinomialNB, this classifier is suitable for discrete data.
- The difference is that while MultinomialNB works with occurrence counts
- BernoulliNB is designed for binary/boolean features.
- If data is not binary, internally Binarization preprocessing will happen
- Can deal with negative numbers
In [84]:
from sklearn.datasets import make_classification
In [85]:
X, Y = make_classification(n_samples=500, n_features=2, n_informative=2, n_redundant=0)
In [86]:
plt.scatter(X[:,0],X[:,1],c=Y,s=10, cmap='viridis')
Out[86]:
<matplotlib.collections.PathCollection at 0x1bad88e2d30>
In [87]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.model_selection import train_test_split
In [88]:
trainX,testX,trainY,testY = train_test_split(X,Y)
In [95]:
bnb = BernoulliNB(binarize=0.0)
mnb = MultinomialNB()
In [96]:
bnb.fit(trainX, trainY)
#mnb.fit(trainX, trainY)
Out[96]:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
In [97]:
bnb.score(testX,testY)
Out[97]:
0.952
In [98]:
#mnb.score(testX,testY)
In [99]:
h = .02
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
In [100]:
Z = bnb.predict(np.c_[xx.ravel(), yy.ravel()])
In [101]:
Z = Z.reshape(xx.shape)
In [102]:
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
plt.scatter(X[:,0],X[:,1],c=Y,s=10)
Out[102]:
<matplotlib.collections.PathCollection at 0x1bad893ed68>
Out-of-core training
- Naive Bayes supports partial_fit function
- For data which cannot be fit to RAM, we can use partial_fit function to gradually train the model
In [290]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[A-Za-z]+')
In [291]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
- HashingVectorizer is suited for large data, since it doesn’t maintain state
In [294]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,
alternate_sign=False)
In [316]:
review_data_chunks = pd.read_csv('Reviews.csv', chunksize=20000)
In [311]:
test = pd.read_csv('Reviews.csv').sample(10000)
In [313]:
test = test[['Text','Score']]
test = test[test.Score != 3]
test['Sentiment'] = test.Score.map(lambda s:0 if s < 3 else 1)
test.Text = test.Text.map(lambda x:tokenizer.tokenize(x))
test.Text = test.Text.map(lambda l: [stemmer.stem(word) for word in l])
test.Text = test.Text.str.join(sep=' ')
test_tf = vectorizer.transform(test.Text)
In [314]:
mnb = MultinomialNB(class_prior=[.22,.78])
- Taking chunk of data each time & fitting the model & gradually improving it
In [317]:
for idx,review_data in enumerate(review_data_chunks):
print ('iter : ',idx)
review_data = review_data[['Text','Score']]
review_data = review_data[review_data.Score != 3]
review_data['Sentiment'] = review_data.Score.map(lambda s:0 if s < 3 else 1)
review_data.Text = review_data.Text.map(lambda x:tokenizer.tokenize(x))
review_data.Text = review_data.Text.map(lambda l: [stemmer.stem(word) for word in l])
review_data.Text = review_data.Text.str.join(sep=' ')
text_tf = vectorizer.transform(review_data.Text)
mnb.partial_fit(text_tf,review_data.Sentiment,classes=[0,1])
y_pred = mnb.predict(test_tf)
print (confusion_matrix(y_pred=y_pred, y_true=test.Sentiment))iter : 0
[[ 3 1434]
[ 1 7781]]
iter : 1
[[ 4 1433]
[ 1 7781]]
iter : 2
[[ 4 1433]
[ 1 7781]]
iter : 3
[[ 5 1432]
[ 1 7781]]
iter : 4
[[ 7 1430]
[ 1 7781]]
iter : 5
[[ 7 1430]
[ 1 7781]]
iter : 6
[[ 8 1429]
[ 1 7781]]
iter : 7
[[ 9 1428]
[ 1 7781]]
iter : 8
[[ 9 1428]
[ 1 7781]]
iter : 9
[[ 9 1428]
[ 1 7781]]
iter : 10
[[ 9 1428]
[ 1 7781]]
iter : 11
[[ 10 1427]
[ 1 7781]]
iter : 12
[[ 11 1426]
[ 1 7781]]
iter : 13
[[ 13 1424]
[ 1 7781]]
iter : 14
[[ 14 1423]
[ 1 7781]]
iter : 15
[[ 15 1422]
[ 1 7781]]
iter : 16
[[ 15 1422]
[ 1 7781]]
iter : 17
[[ 17 1420]
[ 1 7781]]
iter : 18
[[ 17 1420]
[ 1 7781]]
iter : 19
[[ 18 1419]
[ 1 7781]]
iter : 20
[[ 17 1420]
[ 1 7781]]
iter : 21
[[ 18 1419]
[ 1 7781]]
iter : 22
[[ 19 1418]
[ 1 7781]]
iter : 23
[[ 20 1417]
[ 1 7781]]
iter : 24
[[ 20 1417]
[ 1 7781]]
iter : 25
[[ 23 1414]
[ 1 7781]]
iter : 26
[[ 24 1413]
[ 1 7781]]
iter : 27
[[ 24 1413]
[ 1 7781]]
iter : 28
[[ 25 1412]
[ 1 7781]]
- As we can see, model keeps improving
- Next step would be to use sampling techniques to improve further accuracy