Demystifying the Mystical: My Foray into the World of AI

Week 5: Introduction to Machine Learning

Published in

ai6-ilorin

8 min readFeb 13, 2020

Who says AI is not demanding and daunting? Fine, it is. And the purpose of the AI Saturdays across the globe is to demystify it and make it easy. I have once written an article on how AI started and how it evolved into the contemporary world to become a prerequisite in many things, but it appears only to be the grand framework, like a statement that generalises that ability of computers to mimic human intelligence and learn. AI does have a subset — which is its core — and no one learns AI without learning the subset, hence the week 5 of AI Saturdays Ilorin served as a medium that linked the geeks to ML. ML? Mechanical Layers? Mathematical Logarithms? No dude, no! ML, Machine learning. Tom Mitchell put the meaning of Machine learning this way

A computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

But to a layman, that sounds somehow abstruse and incomprehensible. In a more simple language, Machine learning is the scientific study of algorithms and statistical models that machines use to perform a specific task without using explicit instructions, relying on patterns and inference instead. ML uses training data to make predictions or decisions without being explicitly programmed to perform the task(this correlates with the definition given by Arthur Samuel). In a more simpler term, Machine learning is the task of teaching machines to learn from a set of instructions known as algorithms, but what is an algorithm? Read on.

An algorithm is like a soul that gives Machine learning life. It is a finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or to perform a computation. An algorithm describes for a computer how to do something and can be expressed in a well-defined formal language, for example, English Language. Computers don’t understand the Queen Victoria’s tongue and so the algorithm written in the tongue will be useless until converted to the language computers understand.

Based on the tasks machine learning performs, it can be classified into two categories namely: supervised learning, unsupervised learning and reinforcement learning.

Supervised learning as the name implies is such task that maps an input to an already specified output. It is called supervised learning because the process of an algorithm learning from the training data set can be likened to a trainer or instructor supervising the learning process. There is always the input X and the corresponding output Y, all the algorithm does is to learn how to map X to Y.

Supervised learning tasks can further be grouped into regression and classification tasks. There are also algorithms popular in supervised learning tasks such as linear regression for regression problems, random forest for classification and regression problems and support vector machines for classification problems.

Life they say is a gradual process and so is learning AI, so we decided to take a step at a time by starting from linear regression.

What is Linear Regression?

Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and making predictions. Linear Regression is an algorithm that performs a regression task. Linear regression performs the task of predicting a dependent variable value (y) based on a given independent variable (x). This regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.

In the example below, the training set of a regression problem is shown. The training set is of housing prices that are based on the size of each house in feet. Each size in feet denotes the input variable X (also known as feature) and a corresponding output variable or target Y. The task of this algorithm is to find the linear relationship between the two variables and make a prediction based on them.

In the first image above, a vertical line is seen on a graph. When training the model — it fits the best line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ0 and θ1 values. The θ0 depicts the intercepts while θ1 depicts the various slope of X.

Once the best θ0 and θ1 values are found, it is very possible to get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x. To denote the relationship between the two thetas so as to get the best line of fit, a hypothesis is required. Hypothesis? You know that a hypothesis is a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation. But in Machine Learning, a hypothesis is simply a function. Take it this way “ it is a candidate model that approximates a target function for mapping examples of inputs to outputs”. A learning problem is realizable if the hypothesis space contains the true function. Without a correct hypothesis (h), the learning problem is sham.

After an efficient hypothesis has been put in place, it becomes necessary to update the values of θ0 and θ1 to reach the best value that minimizes the error between the predicted value of Y and the true value. Hence, cost function.

In Machine Learning, cost functions are used to estimate how badly models are performing. It is simply a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y. This is typically expressed as a difference or distance between the predicted value and the actual value. There is no any specifi way of estimating the cost function. It can only be estimated depending on the types of cost function being used. It is done to compare estimated predictions against the profound truth — the known values of y. The cost function has different types and can be denoted mathematically. The one used in the particular example is called Mean Squared Error.

Mathematical denotation of Cost function

Optimization

Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x.

Optimization algorithms (in the case of minimization) have one of the following goals:

Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.
Find the lowest possible value of the objective function within its neighbourhood. That’s usually the case if the objective function is not convex as the case in most deep learning problems.

Gradient Descent is the most common optimization algorithm in machine learning and deep learning. Models learn by minimizing a cost function and the cost function is minimized using the gradient descent. It enables a model to learn the gradient or direction that the model should take in order to reduce errors (differences between actual y and predicted y).

What is the difference between Parameter and Hyperparameter?

We have been working with parameters right from the beginning, the θ0 and θ1 are the parameters not foreign to the model in question. So a parameter is a configuration variable that is internal to the model and whose value can be estimated from data. In fact, the parameters are required by the model when making predictions. On the other hand, there is the hyperparameter — a foreigner. Unlike the parameter, a model hyperparameter is a configuration that is external to the data and whose value cannot be estimated from it. Hyperparameters are often used in processes to help estimate model parameters. Learning rate α is a perfect example of hyperparameter.

Learning rate is one such hyperparameter that defines the adjustment in the weights of our network with respect to the loss gradient. In simple language, we can define it as how quickly our network replaces the concepts it has learned up until now with new ones. The diagram below demonstrates different scenarios one fall into when deciding the learning rate.

But the computer doesn’t understand what α and θ are. In the real sense, the hypothesis, the cost function, and all other formulae shown above are nothing but gibberish to the computer. It doesn’t understand any of it. “Are you kidding?” I am not kidding. One does not go ahead and open the Python terminal feeding it with h(x) = θ0 + θ1. No, it’s not done. The computer will be like “are you kidding?” But all the mathematical jargon shown above can be spoken to the machine in the language it understands i.e the language of codes. The images below demonstrate what it looks like to build a model with lines of codes and provide the necessary parameters and functions.

gradient descent represented with lines of codes

It is very difficult to see any real scenario data with just a feature. Because of this, linear regression comes in two dimensions. A linear regression task with one feature is called Univariate Linear Regression and the one with more than one feature is called Multivariate linear regression. The scope of this write-up is on Univariate Linear Regression.

Demystifying the Mystical: My Foray into the World of AI

Week 5: Introduction to Machine Learning

Written by Mubbysani