Guess The Continent — A Naive Bayes Classifier With Scikit-Learn

Published in

Analytics Vidhya

4 min readSep 20, 2019

Over the past two weeks I have been studying, and blogging, about Bayes’ Theorem and Naive Bayes Classifiers. This week I put knowledge to the test and built a classifier that attempts to predict the continent that a country is on based on numerical data from The Word Factbook.

Naive Bayes Classifiers work by assigning a probability to an observation given the prior probability of that observation occurring. First you segment the data by the target variable y to calculate the mean and standard deviation of each subset of data for each feature x. Then you use these statistics to calculate the probability distribution of the data within each group, or your likelihood. Next, you calculate the frequency of each target variable y occurring, or your prior probability for each class. Finally you can predict unknown variables by multiplying its likelihood by the class probability.

In python you can calculate each step with basic custom NumPy functions. I did this with a simple example, however this gets more complicated when you have numerous observations and more than two classes. You can also use Scikit-learn which fortunately has built in Naive Bayes methods.

The Data

The country data comes from Fernando Lasso’s kaggle dataset. It contains country and territory data taken from The World Factbook between 1970 and 2017. There are 277 entries listed, each with 20 columns of data including the name. The first five rows are shown below:

The first step in this process is to clean the data. You may have noticed that the decimal in this dataset is represented by the comma character, which is common outside the United States. Pandas will assume these values are strings and not floats. You will need to replace each comma value with a decimal period before you convert these values to floats. This can be done in one go with the command countries_edited['column_name'] = countries['column_name'].str.replace( ',' , '.' ).astype(float) (There are many ways to do this, here is one example).

Other cleaning tasks can be to neaten up the target variable, which in my case is the column Region. For instance, I aggregate each region by continent because there is a lack of data and a large class imbalance within the data. You can do this with pandas’ .replace method as well.

Once I was done with this step I deleted 19 entries where NaN values were present in numerous rows. Several of these groups were overseas territories or areas where data is not transparent. The remaining target variable counts looks like:

The final count for the number of countries per continent

Building The Classifier

To prepare the data frame for Scikit-learn’s Naive Baye’s Classifier you must split the data in two where one data frame Y holds the target column and the other data frame X holds all of the features. Then you need to convert the Y data frame values to numerical ids with the LabelEncoder() method (You need to use fit_transform() with the Y data frame). There are six continents with countries on them so the numbers will be the integers 0 through 5. Then you need to create a train and test set of data for the classifier using the .train_test_split() method.

Because each feature is on a different scale, the means and standard deviations will not be comparable. Population for instance has a much larger scale than the literacy rate. You need to use a scaler to make each feature comparable in terms of magnitude. The features data frame X must be transformed with a scaler, like the StandardScaler() method.

Scikit-learn has the handy GaussianNB() classifier that can be used to fit the training data. In Naive Bayes Classifiers the calculated probability, or likelihood, that an observation belongs to a class can be represented as the percentage of observations that belong to that class. Once you fit the classifier you can call the .class_prior_ attribute to view each class’ likelihood.

Finally we are ready to use the classifier. You can call the classifier.predict() method on the training data frame X. Scikit-learn also offers several classification metrics, like accuracy. You can now score the classifier with the accuracy_score(test, prediction) method. The code is listed below and on my GitHub profile here.

The code for the continent Naive Bayes Classifier

The accuracy score was only 0.66 for my classifier which means that economic and demographic data alone is not enough to make solid predictions with the limited data that I had.

Conclusion

Although my Naive Bayes Classifier did not perform well, it taught me the principles of this powerful categorization model. Scikit-learn has the tools to simplify the implementation and scoring of these models. This model can be used for more advanced topics like NLP, which is what the past three blogs have been building towards.

Guess The Continent — A Naive Bayes Classifier With Scikit-Learn

The Data

Building The Classifier

Conclusion

Written by Alex Mitrani