What are the most important machine learning algorithms?
ML algorithms that are useful in any data science interview.
Each company wants to test the knowledge as well as the applications of ML algorithms in some part of their interviews. Here, we have the five most commonly used and asked ML algorithms.
In addition to the algorithms, we have tried to provide libraries which support them for easy usage in a interview or an application context as well as provide some sample datasets and project ideas which could be building blocks to showcase expertise.
Linear regression
Linear regression is a method in which you predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. Linear regression is used for continuous targets while logistic regression is used for binary targets as sigmoid curve in the logistic model forces the features to either 0 or 1.
Relevant Python libraries: Scikit-learn, Matplotlib, Pandas, NumPy, PyTorch
Relevant Project Dataset: The Boston Housing Dataset is one of the most commonly used resources for learning to model using linear regression. Using this dataset, one can predict the median value of a home in the Boston area based on different attributes, including crime rate per town, student/teacher ratio per town, the number of rooms in the house and nitric oxides concentration (parts per 10 million).
Decision trees
Decision trees are a transparent way to separate observations and place them into subgroups. A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Classification and regression trees (CART) is a well-known version of a decision tree that can be used for classification or regression. The computer typically chooses the number of partitions to prevent underfitting or overfitting the model. CART is useful in situations where “black box” algorithms may be frowned upon due to inexplicability, because interested parties need to see the entire process behind a decision.
Relevant Python libraries: Scikit-learn, Pandas
K-means clustering
K-means clustering is a method that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the team conducting the analysis. Clustering is often used as a market segmentation approach to uncover similarity among customers or uncover an entirely new segment altogether. Scikit-learn’s tutorials help to learn more about developing clustering models in Python.
Relevant Python libraries: Pandas, NumPy, Scikit-learn, Matplotlib
Relevant Projects: This algorithm is commonly used in marketing to uncover new segments and develop ways to target them based on their shared characteristics.
K-nearest neighbors (k-NN)
Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often euclidian) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data.
In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation.
Relevant Python libraries: NumPy, Scikit-learn
Relevant tutorials: Developing KNN models through Scikit-learn’s documentation
Principal component analysis (PCA)
PCA is a dimension-reduction technique used to reduce the number of variables in a dataset by grouping together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability.
A common application of PCA is aiding in the interpretation of surveys that have a large number of questions or attributes. For example, global surveys about culture, behavior or well-being are often broken down into principal components that are easy to explain in a final report. Scikit-learn tutorials provides basic instruction for how to do PCA in Python.
Relevant Python libraries: NumPy, Scikit-learn, Keras
Relevant dataset: In the Oxford Internet Survey, researchers found that their 14 survey questions could be distilled down to four independent factors.
How do I choose the models for my application?
The model you choose for machine learning depends greatly on the question you are trying to answer via the interview or your application. Important factors to consider include the type of data being analyzed(categorical, numerical, or maybe a mixture of both).
Usually using one of the above algorithms, companies conduct complex analysis (predict, forecast, find patterns, classify, etc.) to automate workflows or do preliminary analysis.
References:
K-Means Clustering Video: Youtube
Tutorials: Scikit-learn’s tutorials
Blog: Algorithmia
Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!
Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.