Machine Learning

Arif Zainurrohman
Nerd For Tech
Published in
13 min readMar 7, 2021
K-Means

What is Machine Learning?

“The ability of machine to do certain task performed by a human without being explicitly programmed to do that task.”

Traditional Computing vs Machine Learning

Traditional Computing vs Machine Learning

Machine Learning or No Machine Learning ?

Traditional Programming refers to any manually created program that uses input data and runs on a computer to produce the output. In Machine Learning, also known as augmented analytics, the input data and output are fed to an algorithm to create a program. This yields powerful insights that can be used to predict future outcomes.

Machine Learning

  • Predicting house price
  • Sentiment analysis
  • Segmenting customer based on buying behavior
  • Credit scoring
  • Spam filtering

No Machine Learning

  • Get the average height of students in a class
  • No-reply email

Machine Learning Implementations

Machine Learning Implementations
Benefit
Machine Learning Algorithms

General Steps

General steps

Data Cleansing

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

  • Duplicate Dataset
  • Missing Data
  • Outliers
  • Data Type

Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

  • Create a New Variable
  • Change all variable to numerical
  • Scaling variable (Depends on the algorithms)
  • Variable reduction
  • Inbalance target class

Data Profiling

Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.

Load data
Describe

Data Exploration

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.

Missing Values Checking and Imputation

Missing values check

Anomaly/Outlier Detection

Anomaly/Outlier Detection

Correlation Heatmap

Correlation Heatmap

Feature Engineering

Feature engineering

Modelling

Modelling

Evaluation

Evaluation

Supervised Learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way. This statistical quality of an algorithm is measured through the so-called generalization error.

Implementation of Supervised Learning:

• Regression — The model finds outputs that are real variables (a number that can have decimals.)

  • Classification — The model finds classes in which to place its inputs.

Classification vs Regression

Classification vs Regression

Regression

Regression is a statistical method used in finance, investing, and other disciplines that attempt to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Building a model regression that is by looking for a relationship between one or more independent variable or predictor (X) with the dependent variable or response (Y).

Regression

Simple Linear Regression

Simple linear regression is a type of regression analysis where the number of independent variables is one and there is a linear relationship between the independent(x) and dependent(y) variable

Simple Linear Regression

Multiple Linear Regression

As a predictive analysis, multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical

Multiple Linear Regression

Ordinary Least Squares

In statistics, ordinary least squares (OLS) is a type of linear least-squares method for estimating the unknown parameters in a linear regression model.

Ordinary Least Squares

Linearity

Linearity requires little explanation. After all, if you have chosen to do Linear Regression, you are assuming that the underlying data exhibits linear relationships, specifically the following linear relationship:

Linearity

Normal Distribution for Error

  • What’s normally is telling you is that most of the prediction errors from your model are zero or close to zero and large errors are much less frequent than small errors.
  • If the residual errors of regression are not N(0, σ2), then statistical tests of significance that depend on the errors having an N(0, σ2) distribution, simply stop working.
Normal Distribution for Error

Homoscedasticity

The variance σ2 should be constant. Particularly, σ2 should not be a function of the response variable y, and thereby indirectly the explanatory variables X.

Homoscedasticity

Autocorrelation

Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent of each other.

Autocorrelation

Multicollinearity

Multicollinearity is a statistical concept where independent variables in a model are correlated. Multicollinearity among independent variables will result in less reliable statistical inferences.

  • Besides having to fulfill all linear regression assumptions, in this Multiple regression model, there is also an additional assumption in the form of multicollinearity.
  • Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related

Classification concept

  • Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.
  • Classification is the task of assigning objects to one of several predefined categories.
  • Classification is a form of data analysis that extracts models describing important data classes.
  • Such models, called classifiers, predict categorical (discrete, unordered) class labels.
  • The classification has numerous applications, including fraud detection, customer churn, loan approval prediction, sales prediction

Classification Algorithms

Several Classification Algorithms in Machine Learning

  • Logistic Regression.
  • Naive Bayes Classifier.
  • K-Nearest Neighbors.
  • Decision Tree.
  • Random Forest.
  • Support Vector Machines.
  • Etc

Decision Tree Concept

Decision tree learning is one of the predictive modeling approaches used in statistics, data mining, and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).

Decision Tree Concept

Decision Tree Modelling

Decision Tree Modelling

Decision Tree Advantages & Disadvantages

Decision Tree Advantages & Disadvantages

CART

Here is the approach for most decision tree algorithms at their simplest. The tree will be constructed in a top-down approach as follows:

  • Start at the root node with all training instances
  • Select an attribute on the basis of splitting criteria (Gain Ratio or other impurity metrics, discussed below
  • Partition instances according to selected attribute recursively
Cart

Train — Test Split

In a dataset, a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Data points in the training set are excluded from the test (validation) set.

Train — Test Split

Overfitting

Overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”. An overfitted model is a statistical model that contains more parameters than can be justified by the data.

Overfitting

K — Fold Cross Validation

k-Fold Cross-Validation. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into.

K — Fold Cross Validation

Ensemble Learning

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Ensemble Learning

Bagging

Bagging, also known as bootstrap aggregating, is the process in which multiple models of the same learning algorithm are trained with bootstrapped samples of the original dataset. Then, like the random forest example above, a vote is taken on all of the models’ outputs.

Boosting

Boosting is a variation of bagging where each individual model is built sequentially, iterating over the previous one. Specifically, any data points that are falsely classified by the previous model is emphasized in the following model. This is done to improve the overall accuracy of the model.

Random Forest

Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data. The model then selects the mode (the majority) of all of the predictions of each decision tree. What’s the point of this? By relying on a “majority wins” model, it reduces the risk of error from an individual tree.

Random Forest

Advantages and Disadvantages of Random Forest

Imbalance Dataset

Imbalanced datasets are a special case for classification problems where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class.

Imbalance Dataset

Undersampling — Oversampling

A widely adopted and perhaps the most straightforward method for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

Undersampling — Oversampling

SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.

SMOTE

Unsupervised Learning

Unsupervised learning (UL) is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, the machine is forced to build a compact internal representation of its world. In contrast to supervised learning (SL) where data is tagged by a human, e.g. as “car” or “fish” etc, UL exhibits self-organization that captures patterns as neuronal predelections or probability densities.

Some of the most common algorithms used in unsupervised learning include: (1) Clustering, (2) Anomaly detection, (3) Neural Networks, and (4) Approaches for learning latent variable models.

Two Basic Unsupervised Learning Techniques

Clustering

The goal is to find homogeneous subgroups within the data; the grouping is based on distance between observations.

Examples: k-means clustering and hierarchical clustering.

Dimensionality Reduction

The goal is to identify patterns in the features of the data. This is often used to facilitate visualization of the data, as well as a pre-processing method prior to supervised learning.

Examples: PCA.

K-Means Clustering

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition and observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

K-Means Clustering

Principal Component Analysis (PCA)

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

PCA

Numerical Representation for Words

Count Vector : Count the word occurence in every document/sentence

Problems with Count Vector

  1. Non-unique resulting vectors, see cat and hat.
  2. What happen when you have so many documents? The vector size grows linearly.

TF-IDF Vector

tf–idf, TF*IDF, or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

TF-IDF Vector

Word Embedding

Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.

Word Embedding

Word2Vec (CBOW and Skip-gram)

Word2vec is a technique for natural language processing. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.

Word2Vec

ML Lifecycle

ML Lifecycle

ML Deployment

ML Deployment

References

Deal with Missing Attributes, Outliers, and Duplicates — Perform an Initial Data Analysis — OpenClassrooms

Class Imbalance | Handling Imbalanced Data Using Python (analyticsvidhya.com)

Data cleansing — Wikipedia

Feature engineering — Wikipedia

Data profiling — Wikipedia

Data exploration — Wikipedia

Supervised learning — Wikipedia

Linear Regression. In this blog I will be writing about… | by Shubhang Agrawal | Analytics Vidhya | Jan, 2021 | Medium

An Insight to Linear Regression in Machine learning | by Muskan Trisal | Analytics Vidhya | Medium

Ordinary least squares — Wikipedia

Assumptions of Linear Regression. And how to test them using Python. | by Sachin Date | Towards Data Science

Assumptions of Linear Regression — Statistics Solutions

Multicollinearity Definition (investopedia.com)

Decision tree learning — Wikipedia

Classification and Regression Trees (CART) Algorithm (opengenus.org)

What is the difference between training and test dataset? | by Sajid Lhessani | Analytics Vidhya | Jan, 2021 | Medium

Overfitting — HandWiki

A Gentle Introduction to k-fold Cross-Validation (machinelearningmastery.com)

--

--

Arif Zainurrohman
Nerd For Tech

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.