Introduction to Machine Learning

Vaani Rawatt
5 min readApr 22, 2020

--

image via: https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications

Machine learning uses computers to predict unknown object attributes through the recognition of patterns in data. Machine Learning algorithms can broadly be divided into the following three subtypes:

  1. Supervised Learning: Supervised Learning uses pre-labeled data to train models.
    for example:- predicting an unknown coin’s value based on its weight and data of coin values and their weights
  2. Unsupervised Learning: Unsupervised Learning uses unlabeled data to visualize patterns and find hidden structures to train models. It groups multiple points together to finally help in making a prediction.
    for example:- classifying a cricketer as a batsman or bowler based on their runs-wicket graphs
  3. Reinforcement Learning: Reinforcement learning is reward-based learning which works on the system of feedback.
    for example:- chatbot determining which responses are appropriate based on user review

» The steps for executing a machine learning model are as follows:

i) Defining the Objective:

  • Classification- outputting a true/false type of result
  • Regression- discovering quantity related results
  • Anamoly Detection- finding anomalies in data
  • Clustering- discovering structure in unexplored date

ii) Collecting Data

iii) Preparing Data

iv) Selecting an Algorithm

v) Training the Model

vi) Testing the Model

v) Making predictions

vi) Deploying the model

Basic Machine Learning Algorithms

I. Linear Regression

Linear Regression is a statistical model used to compute a relationship between continuous independent and dependent variables. It predicts an unknown dependent variable’s value with the help of a set of known dependent-independent variable values.
‣ Independent Variable: a variable whose value doesn’t change by the effect of other variables and is used to manipulate the dependent variable
‣ Dependent variable: a variable whose value changes when there is any manipulation in the values of independent variables

image via: https://en.wikipedia.org/wiki/Linear_regression

Here we examine two factors:
⁕ Which variables are significant predictors of the outcome variables?
⁕ How significant is the regression line to make predictions with the highest possible accuracy?

» Math Behind Model

› Once we get a linear equation in X and Y, we use it to plot the regression line
› The values of Y on this regression line are the predicted values of Y
› The distance between the actual and predicted values are known as residuals or errors
› The best fit line should have the least sum of squares of these errors also known as e square which is what the regression line provides
We are minimizing the sum of squared errors which is the most common approach but there are many other ways.

E.g. sum of absolute errors, root mean square error

Multiple Linear Regression

Y = m1*x1 + m2*x2 + m3*x3 + ………………………………… + c

II. Logical Regression

Logical Regression is an algorithm used for performing binary classification. It solves a discrete Yes/No type of problem i.e. is used for classification.
The output from the sigmoid function is compared with a specific threshold value and if found smaller, then it gives one result and if found larger gives another.

image via: https://www.javatpoint.com/logistic-regression-in-machine-learning

»Math Behind Model

We calculate the odds of the event’s success:

Now to calculate the equation of the sigmoid function:

III. K Means Clustering

K means clustering in an unsupervised machine learning algorithm that performs the division of objects into clusters such that each object is in exactly one cluster.
(K here is the number of clusters we make to finally execute our model and we have to choose K’s optimum value.)

image via: https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/

Distance Measure: Distance measure determines the similarity between two elements and influences the shape of the clusters. For different problems, we might have to use a different kind of distance measure. The kinds of distance measures are-
› Euclidean Distance Measure: the distance between two points
› Manhattan Distance Measure: the sum of horizontal and vertical components or distance between two points measured along axes at right angles
› Squared Euclidean distance measure: square of the distance between two points
› Cosine distance measure: measures the angle between the two vectors

‣ Steps for execution:
i) K centroids are assigned randomly
ii) The distance measure is used to find out which centroid is closest to each data point and the data points are assigned to the corresponding centroid’s cluster
iii) Now the actual centroids of each cluster are determined and the centroid is repositioned to the actual one
iv) The clusters are relocated with respect to the new centroids
v) The repositioning process is repeated till centroid repositioning stops and we get final clusters

IV. K Nearest Neighbors Algorithm

KNN Clustering is a type of supervised machine learning algorithm which works by storing all available cases and classifying new cases based on a similarity measure.

image via: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn

‣ Working:
1) Calculate K nearest neighbors based on distance measures on the plot
2) Classify new case as whatever the majority neighbors are classified as

⁕ How do we choose K?
KNN Algorithm is based on feature similarity, choosing the right value of k is a process called parameter training. In general, we choose a value of k as Sqrt(n) if n is odd and Sqrt(n)-1 if n is even, where n is the total number of data points
(Odd values of k are preferred to avoid confusion or cancellation between the two classes of data)

When to use KNN Clustering?
a) when the data is labeled
b) when the data is noise-free
c) when the data is small

--

--