Stories by Anuja Nagpal on Medium

Principal Component Analysis- Intro

Anuja Nagpal — Tue, 21 Nov 2017 17:53:31 GMT

Variable Reduction Technique

Too many variables? Should you be using all possible variables to generate model?

In order to handle “curse of dimensionality” and avoid issues like over-fitting in high dimensional space, methods like Principal Component analysis is used.

PCA is a method used to reduce number of variables in your data by extracting important one from a large pool. It reduces the dimension of your data with the aim of retaining as much information as possible. In other words, this method combines highly correlated variables together to form a smaller number of an artificial set of variables which is called “principal components” that account for most variance in the data.

Let’s dive in to understand how to PCA is implemented behind the scene.

Start by normalizing the predictors by subtracting the mean from each data point. It is important to normalize the predictor as original predictors can be on the different scale and can contribute significantly towards variance. The result will look like table 2 with a mean of zero.

Normalized Data

Next, calculate the covariance matrix for the data which would measure how two predictors move together. It is measured between two predictors but if you have 3-dimensional data (x, x1, x2), then measure the covariance between x x1, x x2, x1 x2. For reference covariance formula is:

In our case covariance matrix would look like this:

Covariance Matrix

Now, calculate Eigen values and Eigen vector of the above matrix. This helps in finding underlying patterns in the data. In our case it would be approximately:

Eigen Value and Vector

We are almost there :). Perform reorientation. To convert the data into new axes multiply original data with eigenvectors, which suggests the direction of new axes. Note, that you can choose to leave out smaller eigen vector or use both. Also, decide how many set of features to keep based on which set accounts for 95% or more variance.

Finally, the scores calculated from above step can be plotted and and fed into the predictive model. Plots gives us the sense of how close/highly correlated two variables are. Instead of using original data to plot X and Y axis which doesn’t tell us much how points are related to each other, we plot transformed data (using eigen vectors) that find patterns and shows the relationships between points.

End Note: It is easy to confuse PCA with Factor Analysis but there is a conceptual difference between these two methods. I will be going into details of Factor Analysis and how it is different from PCA in my next post.. stay tuned.

Principal Component Analysis- Intro was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Machine Learning Quick Reference Card

Anuja Nagpal — Fri, 03 Nov 2017 16:48:17 GMT

Supervised and Unsupervised Machine Learning

In my last few posts, I had discussed few algorithms that I implement frequently at my work. However, the most important question is when to use these algorithms. There is no one right way of doing things. You play with different algorithms and see which one works best in your case by doing model comparisons.

However, having a quick reference card helps narrow down the options and in making the decision. That also means that this reference card makes some generalizations/simplifications, but it points you towards the right direction.

You can download this reference card here.

Hope you enjoyed this post. Stay tuned for next posts to learn more on Machine learning basics.

Happy Learning!

Machine Learning Quick Reference Card was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Clustering — Unsupervised Learning

Anuja Nagpal — Thu, 02 Nov 2017 21:18:11 GMT

Machine Learning

What is Clustering?

“Clustering” is the process of grouping similar entities together. The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together.

Why use Clustering?

Grouping similar entities together help profile the attributes of different groups. In other words, this will give us insight into underlying patterns of different groups. There are many applications of grouping unlabeled data, for example, you can identify different groups/segments of customers and market each group in a different way to maximize the revenue. Another example is grouping documents together which belong to the similar topics etc.
Clustering is also used to reduces the dimensionality of the data when you are dealing with a copious number of variables.

How does Clustering algorithms work?

There are many algorithms developed to implement this technique but for this post, let’s stick the most popular and widely used algorithms in machine learning.

K-mean Clustering

2. Hierarchical Clustering

K-mean Clustering

It starts with K as the input which is how many clusters you want to find. Place K centroids in random locations in your space.
Now, using the euclidean distance between data points and centroids, assign each data point to the cluster which is close to it.
Recalculate the cluster centers as a mean of data points assigned to it.
Repeat 2 and 3 until no further changes occur.

Now, you might be thinking that how do I decide the value of K in the first step.

One of the methods is called “Elbow” method can be used to decide an optimal number of clusters. Here you would run K-mean clustering on a range of K values and plot the “percentage of variance explained” on the Y-axis and “K” on X-axis.

In the picture below you would notice that as we add more clusters after 3 it doesn't give much better modeling on the data. The first cluster adds much information, but at some point, the marginal gain will start dropping.

Elbow Method

Hierarchical Clustering

Unlike K-mean clustering Hierarchical clustering starts by assigning all data points as their own cluster. As the name suggests it builds the hierarchy and in the next step, it combines the two nearest data point and merges it together to one cluster.

1. Assign each data point to its own cluster.

2. Find closest pair of cluster using euclidean distance and merge them in to single cluster.

3. Calculate distance between two nearest clusters and combine until all items are clustered in to a single cluster.

In this technique, you can decide the optimal number of clusters by noticing which vertical lines can be cut by horizontal line without intersecting a cluster and covers the maximum distance.

Dendogram

Things to remember when using clustering algorithm:

Standardizing variables so that all are on the same scale. It is important when calculating distances.
• Treat data for outliers before forming clusters as it can influence the distance between the data points.

If you learnt something from this article then please ❤ click below so other people will see this on Medium.

Clustering — Unsupervised Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Decision Tree Ensembles- Bagging and Boosting

Anuja Nagpal — Tue, 17 Oct 2017 20:35:36 GMT

Random Forest and Gradient Boosting

We all use Decision Tree technique on daily basis to plan our life, we just don’t give a fancy name to those decision-making process.

Businesses use these supervised machine learning techniques like Decision trees to make better decisions and make more profit. Decision trees have been around for a long time and also known to suffer from bias and variance. You will have a large bias with simple trees and a large variance with complex trees.

Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

Let’s talk about few techniques to perform ensemble decision trees:

1. Bagging

2. Boosting

Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest 😊

Let’s look at the steps taken to implement Random forest:

1. Suppose there are N observations and M features in training data set. First, a sample from training data set is taken randomly with replacement.

2. A subset of M features are selected randomly and whichever feature gives the best split is used to split the node iteratively.

3. The tree is grown to the largest.

4. Above steps are repeated and prediction is given based on the aggregation of predictions from n number of trees.

Advantages of using Random Forest technique:

Handles higher dimensionality data very well.
Handles missing values and maintains accuracy for missing data.

Disadvantages of using Random Forest technique:

Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the regression model.

Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into better performing model.

Gradient Boosting is an extension over boosting method.

Gradient Boosting= Gradient Descent + Boosting.

It uses gradient descent algorithm which can optimize any differentiable loss function. An ensemble of trees are built one by one and individual trees are summed sequentially. Next tree tries to recover the loss (difference between actual and predicted values).

Advantages of using Gradient Boosting technique:

Supports different loss function.
Works well with interactions.

Disadvantages of using Gradient Boosting technique:

Prone to over-fitting.
Requires careful tuning of different hyper-parameters

Decision Tree Ensembles- Bagging and Boosting was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

L1 and L2 Regularization Methods

Anuja Nagpal — Fri, 13 Oct 2017 16:08:37 GMT

Machine Learning

In my last post, I covered the introduction to Regularization in supervised learning models. In this post, let’s go over some of the regularization techniques widely used and the key difference between those.

In order to create less complex (parsimonious) model when you have a large number of features in your dataset, some of the Regularization techniques used to address over-fitting and feature selection are:

1. L1 Regularization

2. L2 Regularization

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.

The key difference between these two is the penalty term.

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.

Cost function

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

Cost function

Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.

The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but these techniques are a great alternative when we are dealing with a large set of features.

L1 and L2 Regularization Methods was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Over-fitting and Regularization

Anuja Nagpal — Wed, 11 Oct 2017 17:42:43 GMT

Machine Learning

In supervised machine learning, models are trained on a subset of data aka training data. The goal is to compute the target of each training example from the training data.

Now, overfitting happens when model learns signal as well as noise in the training data and wouldn’t perform well on new data on which model wasn’t trained on. In the example below, you can see underfitting in first few steps and overfitting in last few.

Now, there are few ways you can avoid overfitting your model on training data like cross-validation sampling, reducing number of features, pruning, regularization etc.

Regularization basically adds the penalty as model complexity increases. Regularization parameter (lambda) penalizes all the parameters except intercept so that model generalizes the data and won’t overfit.

Regularization in cost function

In above gif as the complexity is increasing, regularization will add the penalty for higher terms. This will decrease the importance given to higher terms and will bring the model towards less complex equation.

Stay tuned for the next post which will cover the different type of regularization techniques.

Over-fitting and Regularization was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.