How Naive Bayes classifier works ? — Part 2
Part-2 consists of application of the learning from Part-1. If you like to know about the basics of probability and Naive Bayes, I recommend to go through Part-1.
In this post we will see how to use R’s Naive Bayes library to estimate the chances of a toss decision (bat or field) for a particular venue in Indian Premier League. Yes, I’m a cricket fanatic…
Kaggle is a great destination for data analysts and Machine Learners to get their hands on real world data to solve some real world problems. The dataset for IPL also could be found here. There are 2 files in the package, the one we are interested in is the matches.csv
What we are trying to classify here is the toss decision after a captain has won the toss. Whether he will bat or field. Now this decision largely depends on the pitch condition and time(whether its a day match or an evening one). Based on the data in matches.csv we have all the matches since inception of IPL across all the stadiums along with the toss decision. This is our labelled training set. Based on this dataset,suppose we would like to find out,
P(Bat | Venue = “A venue from the dataset”) and
P(Field | Venue = “A venue from the dataset”)
i.e. what is the probability the a captain will opt for Batting/Fielding given the venue is XYZ.
From the above two probabilities, whichever gives the higher probability will be taken as the classification decision by Naive Bayes.
Below is the complete code for IPL toss classifier in R,
#first intall the pacakage e1071
library(e1071)iplmatches <- read.csv("Location of your matches.csv",header=TRUE,stringsAsFactors=F)#Keep only required columns in the data frame
iplmatches <- iplmatches[,c("toss_winner","venue","toss_decision")]train <- data.frame(class=c(iplmatches$toss_decision),tw=c(iplmatches$toss_winner),venue=c(iplmatches$venue))#This will list in the console all the venues in the dataset
levels(train$venue)#Train the model
classifier <- naiveBayes(class ~ venue,train)#Venue for which we need classify
test <- data.frame(venue=c("Wankhede Stadium"))#Store all the venues a factor
test$venue <- factor(test$venue,levels=c("Barabati Stadium","Brabourne Stadium","Buffalo Park","De Beers Diamond Oval","Dr DY Patil Sports Academy",
"Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium","Dubai International Cricket Stadium",
"Eden Gardens","Feroz Shah Kotla", "Green Park","Himachal Pradesh Cricket Association Stadium",
"Holkar Cricket Stadium","JSCA International Stadium Complex","Kingsmead","M Chinnaswamy Stadium",
"MA Chidambaram Stadium, Chepauk","Maharashtra Cricket Association Stadium", "Nehru Stadium",
"New Wanderers Stadium","Newlands","OUTsurance Oval","Punjab Cricket Association IS Bindra Stadium, Mohali",
"Rajiv Gandhi International Stadium, Uppal","Sardar Patel Stadium, Motera","Saurashtra Cricket Association Stadium",
"Sawai Mansingh Stadium","Shaheed Veer Narayan Singh International Stadium","Sharjah Cricket Stadium",
"Sheikh Zayed Stadium","St George's Park","Subrata Roy Sahara Stadium","SuperSport Park","Vidarbha Cricket Association Stadium, Jamtha",
"Wankhede Stadium"))
prediction <- predict(classifier, test ,type="raw")#Shows the probabilities for bat and field
prediction
The result is shown in the console like,
bat field
[1,] 0.6666667 0.3333333
So the probability of batting is around 66% when the venue is Wankhede stadium as compared to fielding which is ~33%.
So the Naive Bayes classifier predicts that the captain is likely opt to Bat when the venue is Wankhede stadium.