A Weekend With sklearn

Exploring machine learning

Published in

A Weekend With

11 min readJan 31, 2017

So, this started as a weekend but it was so exciting that it ended up taking a lot longer. There’s so much to check out with sklearn and machine learning. This post will give you a brief summary that will allow you to learn a lot about sklearn and machine learning in less than a weekend. First, we’ll cover some basic concepts then we’ll go over some examples.

Basic Concepts

Let’s say I give you a bag of items and I wanted you to tell me what color each item is. This bag has 100 items and I will let you take out 20 or so. You pick an item from the bag, it’s a small cube. You take it out and the cube is red. You then pick another item, this time it’s a sphere. You take it out and the sphere is blue. Now you repeat this for the 20 or so items and you find out that there’s basically red cubes and blue spheres for all the items you select.

After you’ve gone through the 20 items, I ask you to pick one more but this time you have to tell me what color it is before you take it out. Now if it’s a cube, you’ll say it’s probably red and if it’s a sphere you’ll say it’s probably blue. This process is called classification. This is when given some features of an item, you can predict another discrete feature. A discrete feature is one that has finite (limited) possibilities. Given shape, you could predict color based on your experience.

There’s another concept that’s similar to classification but instead of using discrete values it uses continuous ones. Continuous features are those with infinite possibilities. So instead of a bag of items, I give you some information about items such as height and width. Let’s say we do 100 items again and I let you pick 20. You pick the first item and it has a height of 10cm and a width of 20cm. The next item has a height of 15cm and a width of 30cm. Over many items, you start to notice that there’s a trend where the width is double the height. If I tell you the next item’s height is 17.4 cm, you can calculate the width. When we predict continuous values, this is called regression.

Quick recap — given attributes/properties/features of an item, predicting a discrete value for the item is called classification and predicting a continuous one is called regression.

Classification and regression are both types of supervised learning. This is when I give you the bag of items, you get to look at them and see the feature you are predicting (color or width). Then I give you a new item and ask you to predict its color (classification) or width (regression).

Now we’ll cover a different example. I give you a bag of items and I say something like, there are 2 different types of items in this bag. You can take a look at 20 and see everything you need to know about them. Then I’ll let you pick another item and you have to tell me which type it is. We don’t know what to call each type. We can call it type 1 and 2, type A and B, anything we want. The point is, we don’t know what the type really means, all we know is that these n items are similar to each other and these other m items are similar to each other. An example of identifying these types would be to pick an item and notice for example its texture or size. If you notice that a lot of objects have rough edges then you could consider that a type. Another type could be round edged for example.

Now I could have also said that I don’t know how many types are in the bag and you would have to come up with a number of how many groups there are and how to group them. This concept of grouping items is called clustering. Each group is called a cluster. As described here, this is a form of unsupervised learning. This is when I give you a bag of items and you have to look at them and determine how they could be grouped together without me defining a grouping factor or label.

Now there’s one more example to cover. I give you a bag of 100 items and ask you to pick out some items. You can put your hand in the bag and check out all the items before picking one and taking it out of the bag. For each item you pick, I will either give you or take away some money. Your goal is to get as much money from me as you can. You pick an item and notice that it is a sphere, you take the sphere out of the bag. I then give you 4 quarters. You pick out another item, it’s a cube. I then take a quarter away from you. Now, when you picked the cube losing a quarter probably made you think, “Next time I hold a cube I’ll leave it in the bag. If I pick out a sphere, I know I’m getting 4 quarters so I should take it out of the bag”. This concept is called reinforcement learning. This is when based on your actions you get a reward or penalty that you then use to decide your future actions.

The general idea of learning is to create a model from your data that you can then use to make predictions. In other words, you want to gain an understanding of underlying concepts that define how items function and interact based on the information you have. In terms of the above examples a model could be defined as follows:

Supervised learning classification example — Cubes are red, spheres are blue.
Supervised learning regression example — Width of an object is double its height.
Unsupervised clustering example — Rough edged objects are the same and rounded edged objects are the same.
Reinforcement learning example — Spheres are good, Cubes are bad, pick the next item.

Now that’s all great, but what does all this mean and how can we use it? Well, let’s look at some methods that implement these concepts and some code in sklearn.

Note: Reinforcement Learning will not be covered in detail further in this post but it’s important to understand conceptually. Reinforcement Learning isn’t currently supported in sklearn.

sklearn

scikit-learn (or sklearn) is one of the most popular machine learning libraries in Python. In addition to some awesome preprocessing and analysis tools, one of the core parts of sklearn is training a model and using it to predict. This is done in three parts:

Define the model you want to use
Train the model on your data
Predict new values based on training

That’s pretty much the core of it. Now, what does that look like in code?

## Start with any necessary imports
from sklearn.someModule import SomeModel # 1. Define the model you want to use
magic = SomeModel(parameter_one = 1, parameter_two = 2,...)# 2. Train the model
magic.fit(my_properties, my_labels) # Supervised learning
# OR
magic.fit(my_data) # Unsupervised learning# 3. Predict new values
predictions = magic.predict(my_unknowns)

The first thing you might notice is, what is this SomeModel thing? Well, there are a ton of models to pick from and each model will have different parameters that you can customize. We’ll cover a few models here:

Models

There are many models that we could cover for each type of learning problem. For the purpose of this post, I will explain just one method for each example I’ve provided above. You can find many more models on sklearn. The models I will cover are:

Decision Trees
Linear Regression
K-Means Clustering

Decision Trees

In this example, we will discuss Decision Trees as applied to Supervised Classification. Here’s an example of a basic decision tree that you’re probably more familiar with:

The idea behind decision trees is to follow a flow chart but where the end result “action” is deciding on a class for the item being classified.

It’s pretty simple to follow a decision tree, but you’re probably wondering how to create one. The goal is to have the shortest decision tree possible. This can be done by splitting the tree based on the most informative feature/property.

This concept is called splitting on Information Gain. To simplify this, assume you have 3 different attributes; shape, size and weight for each item. You can split the items on a specific value of a feature. Some examples:

Is item a cube?
Is item large?
Is item over 100g?

Now to figure out how to compare these decisions to split on. The goal is to split on the feature that creates the purest subset. When we split on shape, we know that all cubes will be red and all spheres will be blue. For the other potential features, we don’t really know what would happen. We could get some reds that are large and some that are small. This subset would be considered less pure.

Now that’s how the model is built, but you don’t actually have to do much to get sklearn to create the model. You just use the DecisionTreeClassifier.

Here’s what you’d have to do:

## Start with any necessary imports
from sklearn.tree import DecisionTreeClassifier# 1. Define the model you want to use
magic = DecisionTreeClassifier()# 2. Train the model
magic.fit(my_properties, my_colors) # Supervised learning# 3. Predict new values
predictions = magic.predict(my_unknowns)

Linear Regression

Linear Regression is probably the one model you’re most familiar with. In linear algebra, we usually talk about functions of x. We can define one variable, y, in terms of another variable, x, using that function. What linear regression does is it tries to calculate that function given some data of both x and y values. One way of doing this is by finding the line that minimizes the distance to each of the real points.

In our example, we had widths and heights and we were trying to determine the width of an object given its height. Applying a Linear regression in sklearn is quite simple, you just have to use LinearRegression. Here’s how:

## Start with any necessary imports
from sklearn.linear_model import LinearRegression# 1. Define the model you want to use
magic = LinearRegression()# 2. Train the model
magic.fit(my_heights, my_widths) # Supervised learning# 3. Predict new values
predictions = magic.predict(my_unknown_heights)

K-Means Clustering

K-Means Clustering is the unsupervised learning clustering method that we will cover. The idea behind k means clustering is as follows:

You first tell it how many clusters there are (k)
You then select k starting center points, often randomly. These center points do not have to be actual data points but random points in the available ranges of data.
For each point in your data it then calculates which center point is the closest to it. It gets assigned to that point
For each center point, it looks at all the points that were assigned to it and calculates the centroid of those points. The centroid becomes the new center point.
Go back to step 3 and repeat until there are no changes in the designated clusters.

Check out this great visualization on how K Means Clustering works.

In terms of applying this in sklearn, we just use KMeans. Here’s how:

## Start with any necessary imports
from sklearn.cluster import KMeans# 1. Define the model you want to use
magic = KMeans(n_clusters=2)# 2. Train the model
magic.fit(my_data) # Supervised learningmagic.cluster_centers_ # This gives you the centers of each cluster# 3. Predict new values
predictions = magic.predict(my_unknowns)

Okay, well that’s great. There are all these models that we can use, but how do we know which one’s the best one? Well, we’ll need to explore two concepts, cross validation and scoring.

Cross Validation

Given some data, you want to train your model to learn from it. But what’s more important is that you know how well it’s doing so you can figure out how to tune the model or even if you should be using the model you chose. To do that, a common technique is to split up your data into what is called the training set and the test set. The idea here is that you use a portion of your data to train your model and then you use that model on the test set and see how well you did.

The way this is commonly done in sklearn is by using the train_test_split function. Based on your input, this function will return training and testing subsets of what you provided. Here are a couple examples:

# Supervised learning split
properties_training, properties_testing, labels_training, labels_testing = train_test_split(properties, labels)# Unsupervised learning split
data_training, data_test = train_test_split(data)

Based on the current documentation, the default split is 0.25 for testing. What this means is that it will return 75% of your data for training and 25% of it for testing.

So now, this is what you want to do:

For Supervised Learning:

## Split the data
properties_training, properties_testing, labels_training, labels_testing = train_test_split(properties, labels)## 1. Define the model you want to use
magic = SomeModel(parameter_one = 1, parameter_two = 2,...)## 2. Train the model
magic.fit(properties_training, labels_training)## 3. Predict new values
predictions = magic.predict(properties_testing)## Compare predictions and labels_testing

For Unsupervised Learning:

# Split the data
data_training, data_test = train_test_split(data)# 1. Define the model you want to use
magic = SomeModel(parameter_one = 1, parameter_two = 2,...)# 2. Train the model
magic.fit(data_training)# 3. Predict new values
predictions = magic.predict(data_test)# Compare predictions and clusters

That sounds great, but how do we compare predictions and our test data or clusters? Well that’s the next part on scoring.

Scoring

Scoring is basically getting a quantitative measurement that helps you determine how good a model is and allows you to compare models. There are many ways to score a model, in fact there is a whole section in sklearn on different metrics you can use. Pick a metric from the list and get started! The idea behind most metrics is the same, they take true values and predictions and return a score. Here’s an example:

score = SomeMetric(labels_testing, predictions)

This allows you to easily change the metric for comparing models as well as easily using multiple metrics. To compare models, just substitute predictions with the predictions from each model and then compare the returned scores.

For clustering, there are many ways to score your results, you can find them in sklearn’s documentation. Often times, the easiest way to quickly check if your model is good enough, is to plot the data and predictions.

Notebook

Now that you hopefully understand this at a conceptual level, you can check out this Jupyter Notebook to help you go through these examples:

You can see its outputs or download it from the original gist on GitHub to run it locally.

If you want to run it yourself, check out notebooks.azure.com.

Conclusion

That’s it for now. So far we’ve covered the following:

Different types of learning
A method for each of supervised classification and regression and unsupervised clustering
How to use sklearn to train a model to make predictions
Splitting data into training and testing subsets
Scoring our model

I hope this gave you a good introduction into machine learning and how to work with it using sklearn.

References

Udacity’s Machine Learning Engineer Nanodegree — check it out, it’s awesome!

scikit-learn API documentation

I hope you enjoyed this post and would appreciate your feedback. Let me know what you think!