Source : Image by Kevin Phillips from Pixabay

Last Minute Revision: Part I - Machine Learning & Statistics

Prasad Patil

--

Here’s the ready reference for quick go through to some of the key concepts related to Machine Learning and Statistics. This 7 min material curated especially for the readers who are minutes away from appearing before Interview Panel and for those who wish to touch base everything in short time.

I have attempted to cover short summaries and a very high level overview of few important Machine Learning Algorithms-their basic intuition in gist , confusion metrics and few very common statistical terms in this article.

Machine Learning Algorithms

Linear Regression

  • Assuming there is linear relationship between independent feature (X) and target (y), prediction of real continuous target feature by finding best fitting line.
  • Representation :
y = m*X + c

Values of m(slope) and c(intercept) estimated such that cost function is minimized

  • Values of coefficients (m & c) are determined using methods : Ordinary least squares, Gradient descent ,Normal Equation.

Logistic Regression

  • It is used to estimate discrete values in classification problem.
  • It predicts the probability of occurrence of class in target variable .Hence output ranges from 0 to 1.

Representation :

y = 1/(1 + e^(-z))
  • It is an equation of a sigmoid curve.
  • Cost function for a binary classification problem:

-y*log( hθ(x) ) — (1-y)*log( 1- hθ(x) )

  • Performance is measured using confusion matrix, AUC-ROC

Decision Tree

  • Applicable for both classification & regression problem.
  • Entire population is divided into two or more homogeneous sets .
  • This is done by using various techniques like Gini, Chi-square, entropy etc.

Gini Index :

1-[(P)²+(1-P)²]

Chi-square :

(Actual — Expected)²/ Expected)¹/2

Entropy :

-p*log(p) — q*log(q)

Predictions are made by walking down the splits of the tree from root node until arriving at a leaf node.

Output:

for classification — class value at that leaf node

for Regression — average of samples at that terminal.

Random Forest

  • Bootstrap Aggregation of decision trees. A random subset of the training data (with replacement) selected to train each tree.
  • The model randomly restricts the variables which may be used at the splits of each tree.
  • Hence, the trees grown are dissimilar, but they still retain certain predictive power.
  • Random forest is an ensemble technique where multiple models are trained on same data & their predictions are combined either by majority-voting or by taking averages.
  • It is a black box model & more of an engineering hack than algorithm.

Adaptive Boosting

  • Ensemble technique where predictions from multiple weak learners (models) combined to form one strong learner.
  • Algorithm functions by assigning more weight on miss-classified instances than correctly classified ones.

α = weight for the classifier

ε = minimum misclassification error for the model

Gradient Boosting

  • Like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor.
  • However, instead of tweaking the instance weights at every iteration like AdaBoost, it tries to fit the new predictor to the residual errors made by the previous predictor.
  • Gradient boosting machines are generally very slow in implementation.
  • It is because of sequential model training. Hence, they are not very scalable.
  • Remedy to it : eXtreme Gradient Boosting (XGBoost) designed for speed & performance.

K-means Clustering

  • STEP 1: Choose the number K of clusters
  • STEP 2: Select at random K points, the centroids(not necessarily from your dataset)
  • STEP 3: Assign each data point to the closest centroid >That forms K Cluster
  • STEP 4: Compute and place the new centroid of each cluster.
  • STEP 5: Reassign each data point to the new closest centroid. If any reassignment took place, go to STEP 4, otherwise stop

Optimal K value is obtained from Elbow method (aka within cluster variation) and Silhouette Analysis.

Hierarchical Clustering

  • Unlike K-means, here no need to specify number of clusters beforehand.
  • Types: Agglomerative or Bottom-Up Clustering & Divisive or Top-Down Clustering
  • In agglomerative, every data point acts as single cluster (singleton) and based on similarities clusters are merged with each other and together they form a single cluster of all instances.
  • In divisive, exactly reverse process of agglomerative takes place.
  • Distance or similarity between two clusters ,determined by single, complete and average linkage.

Confusion Matrix

It is a N*N matrix where N is number of classes being predicted. Here in our example we have 2 classes(Positive & Negative) ,so it is a 2*2 matrix.

We have data of 165 patients and target variable is whether patient has cancer(+ve class) or not(-ve class).

Let decipher terminologies in confusion matrix one by one…

TN(True Negative) :

Our model predicted that particular patient as non cancer and in reality also he had no cancer. (50 cases/165)

TP (True Positive) :

Our model predicted that a patient has cancer and in reality he actually had a cancer.(100 cases/165)

FP (False Positive aka Type I error):

Our model predicted that a patient has cancer but in reality he had no cancer. (10 cases / 165)

FN(False Negative aka Type II error) :

Our model predicted that a patient has no cancer but in reality he had cancer. (5 cases / 165)

Accuracy :

Total correct predictions out of Total predictions made.

Mathematically,

(TP + TN) /(TP+TN+FP+FN)

From example,

(100+50)/ (50+100+10+5)

= 150/165

= 0.91

Recall :

It is often termed as sensitivity aka accuracy of positive class.

In reality out of 165 patients 105 patients were diagnosed with cancer.

So recall basically says “Out of 105 cancer patients how many patients our model able to pick correctly”

Mathematically,

(TP) /(TP + FN)

From example,

(100)/(100+5)

= (100)/(105)

= 0.95

Precision :

Precision says “Out of all predicted positives- model said patient has cancer, how many patients actually had cancer in real life.”

Mathematically,

(TP) /(TP + FP)

(100)/(100+10)

= (100)/(110) =0.91

Specificity :

It is often termed as accuracy of negative class. Actual negative cases correctly identified out of Total.

Out of 165 patients, there were 60 patients with cancer , so basically specificity means out those 60 patients how many patients our model could pick the non cancer ones correctly.

Mathematically ,

TN / (TN + FP)

50 / (50+10)

= 50 / 60

= 0.83

STATISTICS

Correlation

range = [-1,1]

Intuition : Lets say we have 2 features in data set namely x & y. Both x and y correlated when y is moving when x moves. If direction of their movement is same then they are positively correlated otherwise negatively.

Point to note that positive correlation isn’t mean good correlation and negative correlation does not mean bad correlation. Polarities simply indicate the direction of movement.

Correlation can be measured using Pearson correlation coefficient(r).

r =1 means ,for every +ve increment of 1 in x , there is +ve increment of 1 in y.

r = -1 means ,for every +ve increment of 1 in x ,there is -ve increment of 1 in y.

Permutation

A selection of objects in which the order of the object matters.

Combination

The number of possible combinations of r objects from set of n objects.

Central Limit Theorem

Irrespective of the underlying distribution of population, the sampling distribution of the sample means is always normal distribution.

  1. Mean of the sampling distribution of sample means(x-bar) is same as population mean(mew)
  2. Standard deviation of sampling distribution is ratio of standard deviation of population by square root of samples drawn(n)

Hypothesis Testing

Ho : Null Hypothesis

H1 : Alternate Hypothesis

Ho is assumed first to be True and tested later. Based on findings it is either accepted or rejected.

H1 can only be accepted & if their is any contradiction we say due to insufficient data we accept H0 but we can not say we are rejecting H1 or H1 is not correct.

Z-Test

Assumptions

  1. Sample selected randomly
  2. Observations are independent
  3. The population standard deviation is known or the sample contains at least 30 observations

Hope you find it useful and helps you relieve some of those brain muscles in jittery situations. Stay tuned for the next article! #HappyLearning

Previously Published Articles -

--

--