A 15-minute Primer to Machine Learning

Aswin Mohan
8 min readDec 18, 2016

--

No one needs a Ph.D to teach a Machine to Learn , just a heart to teach

Hey there , future ML Engineer who just stared at a girl with photoshopped circuits on her skin.

Welcome to this short Introduction to Machine Learning , where we will work from top to bottom on a classic ML problem and learn something along the way.

What is Machine Learning ?

Machine Learning is all the buzz now , ML here , ML there. Everyone seems to be using it , Google , Amazon , Tesla , Medium , yeah you get the point.

Even Google search is saying everyone is intrested in Machine Learning.

Machine Learning is a method of data-analysis where “computers are given the ability to learn without being explicitly programmed”(Arthur Samuel, 1959). The machine is made to look at large sets of training data and , it figures out on it’s own how everything relates to each other.

Let’s say your dad owns a big cucumber farm somewhere in Japan . You have a lot of cucumbers ripe for harvest , and boy you are pumped. After a day of work you harvest a gazzillion cucumbers. But there is a catch all cucumbers are not the same and you have to sort them up by hand , and there are 9 diffrent classes of them.

Well being a programmer , you think about easing the pressure on your mother who has to sort them out by hand. You are now left with two options.

  • Hard coding a Image recognition algorythm to classify each of the cucumber into diffrent classes that could span 1000,s of lines + another 2000 to handle edge cases and still be shi**y OR
  • Train a Machine Learning model by showing it thousands of images of Cucumbers labelled with each of the diffrent classes it belongs to , and then use this model for future predictions.

I don’t know about you, but someone had a dad with a cucumber farm in Japan with the same exact problem , and he used the Machine Learning model which is still sorting out the cucumbers. His name is Makoto and you can read about him here.

Ok I’m in

Good to hear that, let’s get started on some Models.

Things you should be

Remember the cheesy caption , that said a heart to teach, sadly that is not the only requirement

  • Beginner level Python3 and Intermediate level Programming experience
  • Maths upto the Level that Mean and Median are two different things
  • That’s all, really let’s get started

Tools of Trade

The reason ML is so simple nowadays than it was a decade ago, is resides solely on the fact that there exists some awesome librariers that does the Heavy lifting for us.

  • NumPy : Powerful fast arrays
  • SciPy : Python based Ecosystem for Scientific Computing
  • Matplotlib : Plot Graphs without a Sweat
  • Pandas : High Performance Easy to use Dataframes
  • Scikit-Learn : The Meat of the bunch with the Algorythms and everything else

Step 0 : Settings Things up

Yeah , We are already here, so let’s start up the console and set up everything.

Run these Commands to install all our dependencies

pip install numpypip install scipypip install matplotlibpip install pandaspip install scikit-learn

Or setup Anaconda , you can read about this here.

That’s it , let’s get Crunching

Step 1 .1: Get the Data and Observe …

Machine Learning is all about data , He who has the most is the best , (remember free photo storage from Google ).

For this tutorial we are going to use a classic dataset , more like the hello-world of Machine Learning , The Iris dataset.

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. (Wikipedia)

The Dataset is just a table containing the Sepal Width and Length , Petal Width and Length in cm ,and the corresponding species out of three to which the flower belongs. It has 150 samples. The dataset is embedded below.

https://www.youtube.com/watch?v=RSKhj2BZQBg

Ok , time to learn some terminologies.

  • The Fact that each species has it’s own characteristics Petal and Sepal Lengths and widths would mean that for a given sample to be a particular species , it depends on these parameters. These parameters are called Features
  • For each Feature set there is a corresponding species name which we would call the Label.

P.S : Real world data won’t be this sweet and you would have to deal with a lot of holes , but since we are just starting out , let’s go with a prebuilt dataset

Step 1.2 : Let’s really get the Data

Stretch your fingers everyone , we’re gonna write some code

Since Iris is such a widely used dataset, it is built into Scikit-Learn.

Run this and if any errors occur , check if everything is set-up correctly.

Now’s let’s explore the Dataset

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']

So the data target has three values 0 , 1 ,2 which corresponds to [‘setosa’ ‘versicolor’ ‘virginica’].

When working with data it’s always good to know the dimensions of the data.

(150, 4)
(150,)

This means that the data is a Multidimensional array of 4 Coloumns x 150 rows and the result is a single array of 150 rows which holds the label for the corresponding Features.

Finally , let’s separate out the Iris dataset into features and Labels

Step 2: The Learning Part of Machine Learning

So now we have the data tucked away neatly in X and y , now it’s time for the exciting part.

Before the Exciting Part aka Theory

Machine Learning falls mainly into three categories

Supervised Learning

In Supervised Learning the model is trained on features which corresponds to a specific label. That is for a set of inputs there is a definite output. The Model then works out the relation between the output variable and input variable.

For the Iris dataset , for the given freatures there exists a certain number corresponding to each species (o,1,2) and during training the Model figures out the realtion between in input features and the output number. It doesn’t care if the numbers are the species name of Iris flowers or the number of teeth , it just finds the realtion between them.

For given (three feature)inputs x,y,z and output o , it finds out a,b and c. So during prediction you just have to plug in the new x , y , z and o will come out happily.

ax + by + cz = o

It can be furthur divided to

  • Classification : Where the algorythm figures out to which species does the given features corresponds to
  • Regression : Where the algorythm predicts the possible real value , given some features. eg: Predicting stock prices from previous prices .

UnSupervised Learning

When the model is trained using only the input data , and no corresponding output data , there isn’t much to be done on the rather than two things

Clustering : You group together similar items to find patterns in the given data

Association: You can find specific realtions between diffrent groups of data of the same dataset

Reinforcement Learning

Ever think about how babies learn to walk , no you only think of yourself.

In reinforcement learning there is a clearly defined goal for the machine and the machine takes a step which it thinks would take it better towards accomplishing it’s goal.Then through it’s feedback loop it determines if that step really had a positive impact towards accomplishing it’s goal. If it had a positive effect then the step is then iterated upon.

So …

Considering our Problem

  • We want to find out which species a flower with given inputs would belong to , so It would be a Supervised Learning Problem.
  • We don’t want to predict any real value but , only need to classify the given input into any one of the three species hence we are looking at a Classification Problem.

So for our Iris set we are going to use a simple yet powerful algorythm called the KNeighboursClassifer

A Little about KNeighboursClassifier

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/531px-KnnClassification.svg.png

Consider the Squares and the Triangles which represent the two species of iris dataset plotted on a graph where the X and Y corresponds to two diffrent features. Consider a circle which corresponds to the test features plotted on the same graph(obviously).

KNearestNeighbour works on the principle of Euclidean distance. The distance of all points are calculated from the circle. Then the distances are sorted in the ascending order. A Parameter n_neighbours is passed through to the algorythm. Then from the distances it takes the first n_neighbours and then classifies the test data as a member of a group which is the majority.

Let’s Classify the Data

Classification of the Data is a multistage process which involves

  • Import the Algorythm that you Intent to use
  • Instantiate the Algorythm
  • Train the Algorythm
  • Predict using the Algorythm

Here ‘s the code

[0]

Hurray , so you have just taught a Machine to understand the diffrence between setosa and virginica , you should feel proud .

So that’s all guys

Or is it ?

Now I think you would have gotten a pretty good basic gist of how Machine Learning is done . We have thought about scratching the surface.

If Machine Learning is the thing for you , welcome abroad else there are other pretty sexy things going around.

P.S : Don’t kill me for mistakes , just point them out , and be sure to hit that Love button .

Follow me on Twitter : @aswinmohanme and Happy Coding everyone.

--

--

Aswin Mohan

There are some things in the world that should be left alone, Code is not one of them, neither is Design, nor is writing but grammar is one of those.