Machine Learning Techniques Primer

ML techniques to get started on any problem

Vimarsh Karbhari
Acing AI
5 min readMay 26, 2020

--

ML is a very vast discipline. However, having the basic algorithms handy is always helpful. The basic ML techniques are the start of any analysis in the data science world.

Similarly, for any scenario based problem in an interview, it is an easy mistake to start with a complex ML Algorithm. Most interviewee’s make the mistake of starting with something that the problem resembles to. They may start with neural networks or combination of different ML algorithms. ALWAYS start with linear/logistic regression if possible. This helps you level set on the most basic benchmark performance for the solution. Approach that question like a programming interview where you start with a benchmark and you proceed to a more optimized solution. The ML techniques primer is a list of techniques and algorithms to aid the start of analysis for a problem in a real life project or an interview.

Photo by Ian Noble on Unsplash

Supervised learning

Regression and classification are the two main subcategories of supervised learning. While both are predictive methods, regression has a numerical output, while classification predicts the category that a new observation would fall into. This is often a binary output, but you can create models for more than two categories. A variation of classification known as multi-class classification provides multiple classes to the dataset.

  • Linear regression: With linear regression, you can predict an output variable using one or more input variables. The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.
  • Logistic regression: Despite the name, logistic regression is a classification algorithm — more specifically it performs a class probability estimation task. A logistic function is applied to a linear equation and the output is interpreted as the log-odds (a number that ranges from -∞-∞) of a new event being a member of a particular class. Linear regression is used for continuous targets while logistic regression is used for binary targets as sigmoid curve in the logistic model forces the features to either 0 or 1.
  • Support vector machine (SVM): A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. The idea of SVM is simple: the algorithm creates a line or a hyperplane which separates the data into classes.
  • Decision tree: Decision trees are a transparent way to separate observations and place them into subgroups. A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. Classification and regression trees (CART) is a well-known version of a decision tree that can be used for classification or regression. The computer typically chooses the number of partitions to prevent underfitting or overfitting the model. CART is useful in situations where “black box” algorithms may be frowned upon due to inexplicability, because interested parties need to see the entire process behind a decision.
  • Random forest Simply put, a random forest is a group of decision trees that all have the same response variable, but slightly different predictor variables. The output of a random forest model is calculated by taking a “vote” of the predicted classification for each tree and having the forest output the majority opinion.

Unsupervised methods

  • Clustering Clustering refers to machine learning techniques that are used to find natural groupings among observations in a dataset. Clustering is also known as unsupervised segmentation.
  • K-means clustering K-means clustering is a machine learning algorithm that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the individual conducting the analysis based on domain knowledge. This type of clustering is often used in marketing and market research as an approach to uncover similarity among customers or to uncover a previously unknown segment.

Dimensionality reduction

Dimensionality reduction algorithms reduce the number of variables in a data set by grouping similar or correlated attributes.

  • Principal Component Analysis (PCA) PCA is a dimension-reduction technique used to reduce the number of variables in a dataset by grouping together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability.
  • K-Nearest Neighbor (KNN) Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often euclidian) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data. In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation.

Recommedations

Based on the problem at hand, it is important to choose the right technique for the right problem. The primer aims to provide a list of techniques but is my no means an exhaustive list. The goal is to provide a starting point rather than an end all be all list for an interview problem or a real life data science problem.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

--

--