introduction to data classification with sklearn

Machine learning is an advancing field in the world and its the new age of computing, so today we will be discussing an algorithm in machine learning called classification.

WHAT IS CLASSIFICATION ?

Data classification can be seen as the collection of data into categories and also identification of data by category. e.g a group of people can be tagged as some data and classifying them into two categories based on Feature and the categories in this case will be gender “Male and Female” which are called Labels. Identifying new set of data(more people) can be done by the classification model which was trained to categories people based on similar features, this process is known as data classification.

LET’S BUILD ONE.

we will be building a classifier in python that can differentiate an iphone 6 from an iphone 5 using our defined features using SVM( support vector machines). you can read up on the algorithm onhttps://en.wikipedia.org/wiki/Support_vector_machine. we will be going through the various algorithms in another post.

SKLEARN

Sklearn is a simple and efficient tools for data mining and data analysis, it has written libraries machine leaning algorithms like K-NEAREST NEIGHBORS, DECISION TREE , SVM. e.t.c you can check out their page on http://scikit-learn.org/stable/

INSTALLING SKLEARN

Requirements:

  • Python (>= 2.6 or >= 3.3),
  • NumPy (>= 1.6.1),
  • SciPy (>= 0.9).

If you already have a working installation of numpy and scipy, the easiest way to install sklearn is using pip

pip install -U scikit-learn

USING SKLEARN

From our favorite text editor we create a file called sklearn.py or you can name it any thing with the “.py” extension.

Next we import svm form sklearn :

importing sklearn

Features

Features can be seen as the attributes that make up an object, it is the collection of data set that makes up an entity or object in this case an iphone.

data set

Using the Data with sklearn

Note: since sklearn only accept numerical values we will be replacing the edges with numbers , covered will be = 1” while “chamfered will be = 0

updated data set

Now using the updated data with sklearn

Labels

labels are identities given to a set of data(Features) , they can be anything ranging form names to ids. In this tutorial an iphone 6 and iphone 5 are the labels given to the above data.

NOTE: like i pointed out above, sklearn only accepts numerical values so we will represent an iphone 6 as 1 and an iphone 5 as 0

Declaring our labels

Labels

Initializing our classifier

classifier

Fiting our data into the classifier

Time to test our model :)

I guess this is the part we have all been waiting for , now we are going to test our model based on some new data set and see how our model performs.

if you notice this is a new data set which i randomly kept in, what our model is expected to do is predict which category(LABEL) this data set falls under based on its study of previous data…

LETS RUN:

from our terminal we will run from the working directory or any name you gave the project file.

Result

Before running the command take a guess on what you think our model will return.

It returned the value 1 which as we recall was the label for an iphone 6. our model classified the new data based on similarities between the new data set and the previous.

Lets bring it all together

with these few lines of codes your can start building your very own machine learning models to classify different objects.

On the next article we will be working with larger data sets of over a thousand data and we will see other algorithms and methods sklearn offers out of the box.