Writing your first Machine Learning Model from Scratch.

Tanishmalekar
Mozilla Firefox Club VIT Vellore
7 min readJul 13, 2020

This is a high level introduction to building your first machine learning program from scratch.

What is Machine Learning

Well, imagine that you’ve obtained a dataset containing characteristics of thousands of human cell samples extracted from patients who were believed to be at risk of developing cancer. Analysis of the original data showed that many of the characteristics differed significantly between benign and malignant samples. You can use the values of these cell characteristics in samples from other patients to give an early indication of whether a new sample might be benign or malignant.

  1. You should clean your data, select a proper algorithm for building a prediction model,
  2. You should train your model to understand patterns of benign or malignant cells within the data.
  3. Once the model has been trained by going through data iteratively, it can be used to predict your new or unknown cell with a rather high accuracy.

This is machine learning!

Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.”

Machine Learning in Daily life

So, machine learning algorithms, inspired by the human learning process, iteratively learn from data, and allow computers to find hidden insights. These models help us in a variety of tasks, such as object recognition, summarization, recommendation, and so on. Machine Learning impacts society in a very influential way. Here are some real-life examples.

  1. First, how do you think Netflix and Amazon recommend videos, movies, and TV shows to its users? They use Machine Learning to produce suggestions that you might enjoy!
  2. How do you think banks make a decision when approving a loan application? They use machine learning to predict the probability of default for each applicant, and then approve or refuse the loan application based on that probability.
  3. Telecommunication companies use their customers’ demographic data to segment them, or predict if they will unsubscribe from their company the next month.

There are many other applications of machine learning that we see every day in our daily life, such as chatbots, logging into our phones or even computer games using face recognition. Each of these use different machine learning techniques and algorithms.

Python and Machine Learning

Python is a popular and powerful general purpose programming language that recently emerged as the preferred language among data scientists. You can write your machine-learning algorithms using Python, and it works very well. The popular libraries used are -

NumPy, Pandas, SciPy, Matplotlib, Scikit Learn

Supervised Algorithms v/s Unsupervised Algorithms

Supervise, means to observe, and direct the execution of a task, project, or activity. We will be supervising a machine learning model that might be able to produce classification regions like we see here.

So, how do we supervise a machine learning model? We do this by teaching the model, that is we load the model with knowledge so that we can have it predict future instances. We teach the model by training it with some data from a labeled dataset same like the image in introduction. here are two types of supervised learning techniques. They are classification, and regression.

  1. Classification is the process of predicting a discrete class label, or category.
  2. Regression is the process of predicting a continuous value as opposed to predicting a categorical value in classification.

Unsupervised learning is exactly as it sounds. We do not supervise the model,

but we let the model work on its own to discover information that may not be visible to the human eye. It means, the unsupervised algorithm trains on the dataset, and draws conclusions on unlabeled data. Generally speaking, unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected. Examples are :-

Dimension reduction, density estimation, market basket analysis, and clustering are the most widely used unsupervised machine learning techniques.

Clustering is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar. Cluster analysis has many applications in different domains, such as helping an individual to organize in-group his, or her favorite types of music. Generally speaking though, clustering is used mostly for discovering structure, summarization, and anomaly detection.

So, to recap, the biggest difference between supervised and unsupervised learning is that supervised learning deals with labeled data while unsupervised learning deals with unlabeled data.

Implementing your first Machine learning model from scatch:

As we saw above, there are two types of machine learning algorithms, supervised and unsupervised.

Supervised algorithms are again divided into regression and classification. In this section, we will be implementing the Simple Linear Regression Algorithm which as the name suggests is a type of regression algorithm.

There are two ways to implement a machine learning model. The first way is to use libraries such as Scikit Learn which have most of the code required for machine learning pre-written. The second way is to write the code ourselves from scratch. I have never seen a professional data scientist write the code himself from scratch because it is a very inefficient and time consuming. Almost everyone uses libraries such as Scikit Learn. But I would suggest that beginners should first try to write the code themselves before moving on to the libraries. This would give them a clear understanding of how the model actually works which would make them better machine learning engineers.

So, in this article I will show you how to implement Simple Linear Regression from scratch in Python.

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single feature.

It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],

y as response vector, i.e y = [y_1, y_2, …., y_n]

for n observations (in above example, n=10).

A scatter plot of above dataset looks like:-

Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset)

This line is called regression line.

The equation of regression line is represented as:

Here,

  • h(x_i) represents the predicted response value for ith observation.
  • b_0 and b_1 are regression coefficients and represent y-intercept and slope of regression line respectively.

To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can use the model to predict responses!

In this article, we are going to use the Least Squares technique.

Now consider:

Here, e_i is residual error in ith observation.
So, our aim is to minimize the total residual error.

We define the squared error or cost function, J as:

and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!

Without going into the mathematical details, we present the result here:

where SS_xy is the sum of cross-deviations of y and x:

and SS_xx is the sum of squared deviations of x:

Given below is the python implementation of above technique on our small dataset:

import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) — n*m_y*m_x
SS_xx = np.sum(x*x) — n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y — b_1*m_x
return(b_0, b_1)def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = “m”, marker = “o”, s = 30)

# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = “g”)
# putting labels
plt.xlabel(‘x’)
plt.ylabel(‘y’)

# function to show plot
plt.show()
def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == “__main__”:
main()

Output of above piece of code is:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437

And graph obtained looks like this:

Conclusion

Therefore, we have successfully implemented your first Machine Learning Model from Scratch.

--

--