Coding Machine Learning Classifiers in 10 minutes with Python & Sklearn

Mohammad Samad
Jun 19, 2020 · 5 min read

Whether you are a beginner in the Machine Learning world or you have some know-how about it, this article will help you learn the practical coding side of ML. Generally Machine Learning has the following types:

  1. Supervised learning
  2. Unsupervised Learning
  3. Reinforcement Learning

In this article we will only be focusing on “Supervised Learning”, as it is often the starting point in machine learning. Generally supervised learning is the mathematical & statistical approach to map a certain number of inputs to an appropriate output, based on the example data provided which is also known as the training dataset. The dataset is always labeled when dealing with supervised learning algorithms. We will be implementing Classification and Logistic Regression (which are Supervised Learning techniques) in python using a machine learning library called “Scikit Learn”. If you want to learn more about Machine Learning’s Types, I suggest you read this article:

Prerequisites :
The only prior requirements are the following installations:
1. Python (3x) version used in this tutorial is 3.7.1
2. Any good text editor like Vs-Code or Jupyter-Notebook
3. Optional — Anaconda environment or python virtual-env

Installation :
The only installation we actually need is “Scikit Learn”. Open the command prompt on Windows or terminal on Ubuntu or mac, run the following command:

pip install -U scikit-learn

or if you are using anaconda then:

conda install -c intel scikit-learn

Very Good! Now we are all set to actually start coding


The code is in 5 steps:

Step 1: Import Sklearn & Dataset from Sklearn examples:

import sklearn
from sklearn.datasets import load_breast_cancer

We are using the breast cancer dataset which comes inbuilt with the sklearn library. The details of this dataset can be found here.

Step 2: Loading the Dataset & Extracting Features, Labels, Feature Names & Label Names:

data = load_breast_cancer()label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

In this step, we loaded the dataset in a variable called ‘data’ and then we extracted the actual data(rows), the column names and output or class label names. In this dataset the label names are malignant and benign, which will be represented as 1 and 0 respectively.

Step 3: Spiriting the Dataset into training and testing sets:

from sklearn.model_selection import train_test_split
train, test, train_labels, test_labels = train_test_split(features,

We first imported a function called train_test_split, this is a built in function that comes with the sklearn library, what it does is that it splits the actual dataset into training and testing sets, so that we may be able to test the accuracy of our final trained model. This code snippet actually does a 66.66% split for the training set and a 33.33% split for the testing set. This distribution can be alerted by the changing the test_size parameter. This function also takes the features and labels that we extracted in Step 2 as parameters as well. Lastly random_state is just defining that we want the split to be random by a factor of 42(this is also changeable)

Step 4: Making a Naive Bayes (NB) Classifier & train it:

from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()
model =, train_labels)

A Naive Bayes classifier is very good with binary classification and is one of the most simplest ones to learn, more details on this algorithm can be found here . This algorithm is available in sklearn by the name GaussianNB. We have initialized an Object called GNB with GaussianNB. After the object is initialized, then we train it on our training set. This is done by calling the .fit() method on out GNB object, the arguments that it takes are the training data and the training labels that we extracted in step 3.

Step 5: Getting Predictions from the Model & calculating its accuracy:

preds = GNB.predict(test)

The .predict() method is used to predict the output variable or class variable on the testing set. The argument it needs is the testing set that we obtained in Step 3. The preds variable holds the prediction result on the training set, it should noted that the output should be in 1 and 0 (malignant and benign). The output of printing the preds would be something similar to this:

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
0 1 1]

Now It’s time to get the accuracy or score of this model:

from sklearn.metrics import accuracy_scorescore_GNB = accuracy_score(test_labels, preds)print(score_GNB)

We import another built-in method from the sklearn library. The accuracy_score method compares the predicted labels with those of the original training set. So naturally the arguments it needs are the testing set labels and the predicted labels. The output that I obtained from this model is:


There you go, we have a accuracy of 94 %, which is very good and with these 5 steps, you have just implemented a machine learning classifier. The next steps are optional and if anyone wants to train a Logistic Regression model should read ahead and if you are already happy with your work then you can skip the next steps.

Step 6: (Optional) Making a Logistic Regression Model and training it:

from sklearn.linear_model import LogisticRegressionlog_reg = LogisticRegression(solver='lbfgs', max_iter=20000)
log_model =, train_labels)

This is very similar to what we did above with the GNB classifier. The LogisticRegression() initialization takes a few arguments to work properly otherwise you may get warnings. For the sake of simplicity just ignore the fact we gave it 2 parameters. We will have a detailed separate article on Logistic Regression and its theory as well.

Step 7: (Optional) Getting Predictions form Logistic regression & accuracy:

log_preds = log_model.predict(test)
print(accuracy_score(test_labels, log_preds))

After predicting the accuracy we get from the Logistic Regression is as follows:


Essentially we have approx. 97% accuracy with this model. This shows an improvement from our previous results. It is always preferred to use different kinds of models, so we can generate the best accurate model for usage or experimentation

Thank you very much for reading this article, kindly give your feedback and I hope you learnt something here.

Helping Material:

The Startup

Get smarter at building your thing. Join The Startup’s +799K followers.

Mohammad Samad

Written by

I am an enthusiastic software engineer, who enjoys working with Machine Learning algorithms and playing around with the latest tech trends.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Mohammad Samad

Written by

I am an enthusiastic software engineer, who enjoys working with Machine Learning algorithms and playing around with the latest tech trends.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store