Data Odyssey: Practical Machine Learning

Simeng Han
Published in
5 min readSep 7, 2019

Artwork by Brendan Hyde

Disclaimer: This workshop is for educational purposes only. No prototype or outcome of any type is intended for commercial use.


Machine Learning is an interdisciplinary subject where computer science and statistics intersect.
In the workshop today, we will focus on the practical aspect of machine learning, i.e.,coding. In most cases, we give our algorithm an input and it gives us an output.
However, for a machine learning algorithm, we first feed a lot of data to the algorithm to let the algorithm determine itself how it should react to the data. This is the process of determining the parameters of the machine learning model.

In supervised machine learning, we feed the input and label, into the model and it will learn how to predict the output when we feed new inputs. Think about supervised learning as learning with a teacher who tells you the right answers.

In unsupervised machine learning, we only feed the input and the model will learn to predict the output solely based on the input. Think about unsupervised learning as learning without a teacher. Not all real-world data have a label, thus the necessity of unsupervised learning.

The second workshop will introduce two machine learning algorithms in order to demonstrate how the field can be used in real-world scenarios.
This includes logistic regression, a supervised method to solve classification problems, as well as k-means clustering, an unsupervised method to group together clusters of data by certain criteria.

We will use scikit-learn, a python package built for implement machine learning algorithms.
Logistic Regression with scikit-learn
K-Means with scikit-learn

Google Colaboratory

Copy this notebook to your own drive

Go to this link to download the data to be used in this workshop and upload it to Google Colaboratory.

Odyssey Begins

  1. Supervised Odyssey: Supervised Classification
  2. Unsupervised Odyssey: Unsupervised Classification
  3. End of journey

Packing up: Environment Setup

Import the module for linear regression algorithm from sklearn, plotting packages from matplotlib and data loading package numpy

Use numpy to load the txt file as a data object

Inspect more details

Plot all data

Logistic Regression is used when the dependent variable(target) is categorical, i.e., we want to find class which each of the variables belongs to. For example, to classify spam emails, we find whether an email belongs to the spam class or the normal class.

Algorithm intuition (online demo)

Coding with sklearn

Sigmoid function adds non-linearity into the model
z is the input to the sigmoid function, which is the dot product of input X and the weight w

Logistic regression predictive function

To conduct logistic regression with sklearn, we first create a LogisticRegression object
Then we fit the model to the data
The intercept_ and coef_ are the model parameters(weights)

After obtaining the parameters, let’s visualize the result by plotting the decision boundary.
Students whose score points are above the decision boundary will be admitted while the students below the decision boundary will be rejected

Now let’s use our trained logistic regression model to predict if a student will be accepted or rejected.

Import the image reading module from matplotlib and the K-Means module from sklearn

Read the image

A 2D image is comprised of two dimensional RGB values.
700 is the row number.
1000 is the column number.
3 is the R, G, B value respectively.


K-means is one of the most popular unsupervised clustering algorithms.
“K” in K-means refers to k number of clusters.
“Means” refers to finding the means, or centroids of the clusters.

Algorithm intuition (online demo)

Coding with sklearn

Reshape the image to be 2-dimension
To run the KMeans algorithm, we first create a scikit-learn KMeans object with the number of clusters assigned to 20, which is the number of colors we want for the compressed image. Fit the model to the data, then use the centroids to compress the image

Reshape X_recovered to have the same dimension as the original image
Now we can plot the original and the compressed image side by side.

Congratulations on completing the Machine Learning Odyssey!
In this workshop we have learned how to use machine learning algorithms to solve some simple real-world problems.
In the next, which is also the last workshop of the NTUOSS Data Science workshop series, we will teach you deep learning, which is a subfield of machine learning and is even more interesting!

An approachable book if you want to learn more A Course in Machine Learning

