# Recap of Intro to Machine Learning Training

Basic machine learning concepts are useful for any analyst! I followed a popular Intro to ML course on Udemy and compiled my high-level notes from the training below.

# A few basic definitions

**Artificial Intelligence**— The idea that a computer can complete tasks that are historically thought of as only being able to be done by a human, such as speech recognition**Machine Learning**— A subset of Artificial Intelligence; a method of how computers create data models by “learning”**Mean**— Expected value, or average**Variance**— Measures how far a set of numbers are spread out from their average value**Covariance**— Measure of the joint variability of two random variables**Confusion Matrix**— A table in which predictions are represented in columns and actual status is represented by rows**Accuracy**— Number of correctly predicted data points out of all data points**Precision**— Number of true positives divided by the number of elements*labeled*as belonging to the positive class (includes false positives)**Recall**— Number of true positives divided by the total number of elements that*actually*belong to the positive class (includes false negatives)**Overfitting**— Model corresponds too closely to the training dataset and doesn’t fit new data points well; model has too many parameters**Underfitting**— Model cannot capture underlying data well; model has too few parameters**Bias**— Error from misclassifications in the learning algorithm**Variance**— Error from sensitive to small changes in the training set**Bias-variance trade-off**: Generally impossible to both capture the relationships within the training dataset and to generalize well to unseen data

**The main type of learning**

**Supervised learning**— We have a dataset with inputs and their correct outputs**Unsupervised learning**— We have a dataset with inputs but no outputs**Reinforcement learning**— We don’t have a dataset

**Linear Regression**

- Supervised
**Simple linear regression**— A model with a single explanatory variable**Multiple linear regression**— A model with several explanatory variables- Needs a dataset
- Uses MSE to optimize model
- Uses R² to measure accuracy of model

**Logistic Regression**

- Supervised
- Usually we use this method for binary classification (e.g. Y/N, health/sick, etc.)
- Assigns probabilities to given outcomes, so the output is a probability that the given input belongs to a certain class
- Uses MLE to optimize model

**Cross Validation**

- A way to observe how well our model will work with new data. We fit the model on the “training” dataset, then run it on the “testing” dataset

**K-folds Cross Validation **—** **Helps to avoid underfitting and overfitting

- Split the data into
*k*folds (e.g.*k*=10). We run*k*separate learning experiments.*k-1*folds for training and 1 fold for the test set. We average results from the*k*experiments - Advantage — All observations are used for both training and validation and each observations are used for validation exactly once

**Naive Bayes Classifier**

- Supervised
- A probabilistic machine learning model that’s used for classification tasks
- It is able to make good predictions even when the training data is relatively small
- An assumption is that every pair of features is independent

**Decision Trees**

- A type of supervised learning approach — mostly used in classification problems but it can be used for regression as well
- Works for both categorical and continuous inputs
**Root node**— Represents the entire dataset/population and this further gets divided into several subsets; corresponds to the best predictor**Decision node**— The algorithm splits the node into sub-nodes based on given feature in the dataset**Leaf nodes**— Nodes with no children; will have values**Splitting**— Increasing the size of the tree by splitting a node into two or more sub-nodes**Pruning**— Reducing the size of the tree by removing nodes- Decision tree classifier accuracy depends heavily on how you split the data. There are different algorithms to decide how to best split the data

**Advantages of decision trees**

- Easy to understand and interpret
- It is one of the best approaches to identify the most significant variables and the relationship between the variables
- No need for data preprocessing, e.g. decision trees are not influenced by outliers, no need for dummy variables, etc.
- Can use numerical variables as well as categorical variables

**Disadvantages of decision trees**

- Have the tendency to overfit, but we can solve it by pruning
- Decision trees can be unstable because small variations in data might result in a completely different tree

**Random Forest Classifier**

- Supervised
- A set of decision trees from a randomly selected subset of the training set
- Better than bagging — this algorithm decorrelates the single decision trees that has been constructed
- Similar to bagging — we keep constructing decision trees on the training data but on every split in the tree, a random selection of features/predictors is chosen from the full feature set
- If one or a few features are very strong predators for the response variable (target output), these features will be selected in many of the decision trees — so they will become correlated
- Huge advantage — at some point the variance stops decreasing no matter how many more trees we add to our random forest and it is not going to produce overfitting

**Boosting**

- The decision trees are grown sequentially so each tree is grown using information from previously grown trees. Each tree will increase in accuracy
- These trees are not independent

# Bagging (Bootstrap Aggregation)

- A weak learner is just a bit better than a random guess, but combining weak learners can be an extremely powerful classifier
- We can reduce the variance by averaging a set of observations
- We do not have several training sets, but we can take repeated samples from the single data set and construct trees and average all the predictions. Each tree is constructed independently
- Problem with pruning — variance decreases but we have some bias…here we can reduce the variance without extra bias
- Problem with bagging — The constructed trees are highly correlated since they will tend to share similar splits

**Boosting vs Bagging**

- Both use N learners and yields more stable models

**Boosting**

- Every item has the same probability to appear in a new dataset
- Reduces variance and solves overfitting
**Preferred**

**Bagging**

- The samples are weighted so some of them will occur more often
- Reduces bias but prone to overfitting

**Principal component analysis**

- Unsupervised
- Reduces the dimensionality of datasets by finding linear combinations of features/variables that are mutually uncorrelated

**K-Means Clustering**

- Unsupervised
- Automatically divides the data into clusters of similar items and does this without having been told what the groups should look like ahead of time
- K-means clustering aims to partition
*n*observations into*k*clusters in which each observation belongs to the cluster with the nearest mean

**Finding ***k* parameter

*k*parameter

- Sometimes we know how many clusters we want to construct. If we don’t, then
*k*is approximately equal to the square root of*n*/2 *n*is the number of elements in the dataset

**Advantages of **K-means clustering

- Relies on simple principles to identify clusters
- Flexible
- Efficient

**Disadvantages of K-means clustering**

- Not so sophisticated
- Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters
- We have to specify
*k*

**DBSCAN clustering (Density Based Spatial Clustering of Applications with Noise)**

- Unsupervised
- Density-based, so given a set of points in some space, it groups together points that are closely packed together
- Outperforms k-means clustering because faster and we don’t have to define
*k*ahead of time

**Advantages of DBSCAN**

- Finds non-linearly separable clusters (arbitrarily shaped clusters)
- We do not have to specify the number of clusters (
*k*) we want to find - Very robust to outliers and noise
- The result does not depend on the starting conditions

**Disadvantages of DBSCAN**

- Not entirely deterministic
- Border points that are reachable from more than one cluster can be part of either cluster depending on the order the data is processed
- Relies heavily on a distance measure — Euclidean-measure. In higher dimensions, it is very hard to find a good value for epsilon
- If the data and scale are not well understood, then choosing a meaningful distance threshold epsilon can be difficult
- Cannot handle varying densities

**Hierarchical Clustering**

- We build a tree-like structure out of the data which “contains” all the
*k*parameters - In data mining, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters
- Can take a “bottom-up” (agglomerative) approach so that each observation starts in its own cluster and merges with others as it moves up the hierarchy, or a “top-down” (divisive) approach so that each observation belongs to one cluster and splits as it moves down the hierarchy

**Neural Networks**

Computers can solve problems, but there are problems that cannot be defined with an exact mathematical algorithm like facial recognition which is easy for humans

- Can be supervised or unsupervised
- Inspired by biological neural networks
- We represent each neuron with a node; it is basically a directed/undirected graph
- Each edge has a weight
- These neural networks are capable of learning by changing the weights of their connections
- Neurons only fire when input is larger than a given threshold. Firing doesn’t get bigger as the stimulus increases, it either fires or doesn’t fire

**Artificial Neural Networks (ANN)** — Any form of deep learning model

**Deep Neural Networks (DNN)** — Has multiple hidden layers; used for regression and classification

**Convolutional Neural Networks (CNN) **— For computer vision and can process pixel data, used for images, e.g. self-driving cars

**Recurrent Neural Networks (RNN) **— Data can flow in any direction and are used to process sequences of data; used for series analysis, e.g. stock market or language modeling

**Input layer** — Provides information from the outside world to the network; no computation is performed

**Hidden layer** — Performs calculations and transfers information to the output layer; there can be zero or multiple hidden layers

**Output layer** — Produces the result to the outside world

**Example in humans **— Input: See a handwritten character; Output — Recognize the character

**Model**

- We have the incoming weights
- Sum up the weights
- Activation function gives the output

**Perceptrons **— Neurons in the neural network

- A small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip. That flip may then cause the behavior of there rest of the network to completely change in some very complicated way

**Sigmoid neuron** — Very similar to perceptrons

- But small change in the edge weights cause small change in the output, i.e. no flips
- The inputs and outputs can take any values between 0 and 1

**Feedforward neural network**

- Information is going from the input layer to the output later in one direction, so there are no “cycles” or “loops” back and forth between layers
- Every node is connected to every node in the next layer

**The learning algorithm**

- Initialize the edge weights at random
- Calculate the error — we have some training data and some results
- Calculate the changes of the edge weights and update the weights. This is the “backpropagation” process
- The algorithm terminates when the network error is small

**Backpropagation**

- An algorithm that changes the edge weight for feedforward neural networks
**Learning rate**— Defines how fast the algorithm will learn. If it’s too high, it’ll converge fast but may not be accurate because it may miss the global optimum. If it is too slow, algorithm will be slower but more accurate**Momentum**— Defines how much we are relying on the previous change, simply adds a fraction of the previous weight update to the current one. High momentum helps to increase the speed of convergence of the system, but it can overshoot the minimum. Low momentum cannot avoid local optimums and slows down training- We update the edge weights starting from the output layer and go backwards