Machine Learning
What is Machine Learning?
“The ability of machine to do certain task performed by a human without being explicitly programmed to do that task.”
Traditional Computing vs Machine Learning
Machine Learning or No Machine Learning ?
Traditional Programming refers to any manually created program that uses input data and runs on a computer to produce the output. In Machine Learning, also known as augmented analytics, the input data and output are fed to an algorithm to create a program. This yields powerful insights that can be used to predict future outcomes.
Machine Learning
- Predicting house price
- Sentiment analysis
- Segmenting customer based on buying behavior
- Credit scoring
- Spam filtering
No Machine Learning
- Get the average height of students in a class
- No-reply email
Machine Learning Implementations
General Steps
Data Cleansing
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
- Duplicate Dataset
- Missing Data
- Outliers
- Data Type
Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
- Create a New Variable
- Change all variable to numerical
- Scaling variable (Depends on the algorithms)
- Variable reduction
- Inbalance target class
Data Profiling
Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.
Data Exploration
Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.
Missing Values Checking and Imputation
Anomaly/Outlier Detection
Correlation Heatmap
Feature Engineering
Modelling
Evaluation
Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way. This statistical quality of an algorithm is measured through the so-called generalization error.
Implementation of Supervised Learning:
• Regression — The model finds outputs that are real variables (a number that can have decimals.)
- Classification — The model finds classes in which to place its inputs.
Classification vs Regression
Regression
Regression is a statistical method used in finance, investing, and other disciplines that attempt to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).
Building a model regression that is by looking for a relationship between one or more independent variable or predictor (X) with the dependent variable or response (Y).
Simple Linear Regression
Simple linear regression is a type of regression analysis where the number of independent variables is one and there is a linear relationship between the independent(x) and dependent(y) variable
Multiple Linear Regression
As a predictive analysis, multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical
Ordinary Least Squares
In statistics, ordinary least squares (OLS) is a type of linear least-squares method for estimating the unknown parameters in a linear regression model.
Linearity
Linearity requires little explanation. After all, if you have chosen to do Linear Regression, you are assuming that the underlying data exhibits linear relationships, specifically the following linear relationship:
Normal Distribution for Error
- What’s normally is telling you is that most of the prediction errors from your model are zero or close to zero and large errors are much less frequent than small errors.
- If the residual errors of regression are not N(0, σ2), then statistical tests of significance that depend on the errors having an N(0, σ2) distribution, simply stop working.
Homoscedasticity
The variance σ2 should be constant. Particularly, σ2 should not be a function of the response variable y, and thereby indirectly the explanatory variables X.
Autocorrelation
Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent of each other.
Multicollinearity
Multicollinearity is a statistical concept where independent variables in a model are correlated. Multicollinearity among independent variables will result in less reliable statistical inferences.
- Besides having to fulfill all linear regression assumptions, in this Multiple regression model, there is also an additional assumption in the form of multicollinearity.
- Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related
Classification concept
- Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y.
- Classification is the task of assigning objects to one of several predefined categories.
- Classification is a form of data analysis that extracts models describing important data classes.
- Such models, called classifiers, predict categorical (discrete, unordered) class labels.
- The classification has numerous applications, including fraud detection, customer churn, loan approval prediction, sales prediction
Classification Algorithms
Several Classification Algorithms in Machine Learning
- Logistic Regression.
- Naive Bayes Classifier.
- K-Nearest Neighbors.
- Decision Tree.
- Random Forest.
- Support Vector Machines.
- Etc
Decision Tree Concept
Decision tree learning is one of the predictive modeling approaches used in statistics, data mining, and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).
Decision Tree Modelling
Decision Tree Advantages & Disadvantages
CART
Here is the approach for most decision tree algorithms at their simplest. The tree will be constructed in a top-down approach as follows:
- Start at the root node with all training instances
- Select an attribute on the basis of splitting criteria (Gain Ratio or other impurity metrics, discussed below
- Partition instances according to selected attribute recursively
Train — Test Split
In a dataset, a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Data points in the training set are excluded from the test (validation) set.
Overfitting
Overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”. An overfitted model is a statistical model that contains more parameters than can be justified by the data.
K — Fold Cross Validation
k-Fold Cross-Validation. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into.
Ensemble Learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.
Bagging
Bagging, also known as bootstrap aggregating, is the process in which multiple models of the same learning algorithm are trained with bootstrapped samples of the original dataset. Then, like the random forest example above, a vote is taken on all of the models’ outputs.
Boosting
Boosting is a variation of bagging where each individual model is built sequentially, iterating over the previous one. Specifically, any data points that are falsely classified by the previous model is emphasized in the following model. This is done to improve the overall accuracy of the model.
Random Forest
Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data. The model then selects the mode (the majority) of all of the predictions of each decision tree. What’s the point of this? By relying on a “majority wins” model, it reduces the risk of error from an individual tree.
Advantages and Disadvantages of Random Forest
Imbalance Dataset
Imbalanced datasets are a special case for classification problems where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class.
Undersampling — Oversampling
A widely adopted and perhaps the most straightforward method for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).
SMOTE
Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.
Unsupervised Learning
Unsupervised learning (UL) is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, the machine is forced to build a compact internal representation of its world. In contrast to supervised learning (SL) where data is tagged by a human, e.g. as “car” or “fish” etc, UL exhibits self-organization that captures patterns as neuronal predelections or probability densities.
Some of the most common algorithms used in unsupervised learning include: (1) Clustering, (2) Anomaly detection, (3) Neural Networks, and (4) Approaches for learning latent variable models.
Two Basic Unsupervised Learning Techniques
Clustering
The goal is to find homogeneous subgroups within the data; the grouping is based on distance between observations.
Examples: k-means clustering and hierarchical clustering.
Dimensionality Reduction
The goal is to identify patterns in the features of the data. This is often used to facilitate visualization of the data, as well as a pre-processing method prior to supervised learning.
Examples: PCA.
K-Means Clustering
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition and observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
Principal Component Analysis (PCA)
Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.
Numerical Representation for Words
Count Vector : Count the word occurence in every document/sentence
Problems with Count Vector
- Non-unique resulting vectors, see cat and hat.
- What happen when you have so many documents? The vector size grows linearly.
TF-IDF Vector
tf–idf, TF*IDF, or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Word Embedding
Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
Word2Vec (CBOW and Skip-gram)
Word2vec is a technique for natural language processing. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.
ML Lifecycle
ML Deployment
References
Class Imbalance | Handling Imbalanced Data Using Python (analyticsvidhya.com)
Feature engineering — Wikipedia
Supervised learning — Wikipedia
An Insight to Linear Regression in Machine learning | by Muskan Trisal | Analytics Vidhya | Medium
Ordinary least squares — Wikipedia
Assumptions of Linear Regression — Statistics Solutions
Multicollinearity Definition (investopedia.com)
Decision tree learning — Wikipedia
Classification and Regression Trees (CART) Algorithm (opengenus.org)
A Gentle Introduction to k-fold Cross-Validation (machinelearningmastery.com)