Acing AI

Acing AI provides analysis of AI companies and ways to venture into them.

What are the most important machine learning algorithms?

5 min readMay 19, 2020

--

Each company wants to test the knowledge as well as the applications of ML algorithms in some part of their interviews. Here, we have the five most commonly used and asked ML algorithms.

In addition to the algorithms, we have tried to provide libraries which support them for easy usage in a interview or an application context as well as provide some sample datasets and project ideas which could be building blocks to showcase expertise.

Linear regression

Linear regression is a method in which you predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. Linear regression is used for continuous targets while logistic regression is used for binary targets as sigmoid curve in the logistic model forces the features to either 0 or 1.

Relevant Python libraries: Scikit-learn, Matplotlib, Pandas, NumPy, PyTorch

Relevant Project Dataset: The Boston Housing Dataset is one of the most commonly used resources for learning to model using linear regression. Using this dataset, one can predict the median value of a home in the Boston area based on different attributes, including crime rate per town, student/teacher ratio per town, the number of rooms in the house and nitric oxides concentration (parts per 10 million).

Decision trees

Decision trees are a transparent way to separate observations and place them into subgroups. A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Classification and regression trees (CART) is a well-known version of a decision tree that can be used for classification or regression. The computer typically chooses the number of partitions to prevent underfitting or overfitting the model. CART is useful in situations where “black box” algorithms may be frowned upon due to inexplicability, because interested parties need to see the entire process behind a decision.

Relevant Python libraries: Scikit-learn, Pandas

K-means clustering

K-means clustering is a method that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the team conducting the analysis. Clustering is often used as a market segmentation approach to uncover similarity among customers or uncover an entirely new segment altogether. Scikit-learn’s tutorials help to learn more about developing clustering models in Python.

Source: Medium

Relevant Python libraries: Pandas, NumPy, Scikit-learn, Matplotlib

Relevant Projects: This algorithm is commonly used in marketing to uncover new segments and develop ways to target them based on their shared characteristics.

K-nearest neighbors (k-NN)

Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often euclidian) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data.

In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation.

Source: Research Gate

Relevant Python libraries: NumPy, Scikit-learn

Relevant tutorials: Developing KNN models through Scikit-learn’s documentation

Principal component analysis (PCA)

PCA is a dimension-reduction technique used to reduce the number of variables in a dataset by grouping together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability.

A common application of PCA is aiding in the interpretation of surveys that have a large number of questions or attributes. For example, global surveys about culture, behavior or well-being are often broken down into principal components that are easy to explain in a final report. Scikit-learn tutorials provides basic instruction for how to do PCA in Python.

Relevant Python libraries: NumPy, Scikit-learn, Keras

Relevant dataset: In the Oxford Internet Survey, researchers found that their 14 survey questions could be distilled down to four independent factors.

How do I choose the models for my application?

The model you choose for machine learning depends greatly on the question you are trying to answer via the interview or your application. Important factors to consider include the type of data being analyzed(categorical, numerical, or maybe a mixture of both).

Usually using one of the above algorithms, companies conduct complex analysis (predict, forecast, find patterns, classify, etc.) to automate workflows or do preliminary analysis.

References:

K-Means Clustering Video: Youtube

Tutorials: Scikit-learn’s tutorials

Blog: Algorithmia

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

--

--

Acing AI
Acing AI

Published in Acing AI

Acing AI provides analysis of AI companies and ways to venture into them.

Vimarsh Karbhari
Vimarsh Karbhari

Written by Vimarsh Karbhari

Engineering Manager | Founder of Acing AI

Responses (1)