Common ML models

Jessic Akushey
5 min readJun 30, 2023

--

Machine learning algorithms can be broadly categorized into four main categories: supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms and reinforcement learning. This article focuses mainly on supervised and unsupervised learning algorithms.

  1. Supervised Learning Algorithms: Supervised learning involves training models on labeled data, where input features and their corresponding target values are known. These algorithms learn the mapping between input features and target values, allowing them to make predictions on unseen data. Key supervised learning algorithms include regression and classification models.
  2. Unsupervised Learning Algorithms: Unsupervised learning focuses on finding patterns and structures in unlabeled data. These algorithms aim to uncover hidden insights, discover relationships, or group similar instances without any predefined target variable. Common unsupervised learning algorithms include clustering and dimensionality reduction techniques.

Supervised Learning Algorithms

  1. Linear Regression: Linear regression is a simple yet powerful algorithm used for predicting a continuous numerical value based on input features. It assumes a linear relationship between the input features and the target variable. The model estimates the coefficients for each feature, and the prediction is made by computing a weighted sum of the features. Linear regression is widely used in tasks such as sales forecasting, stock market analysis, or housing price prediction.
  2. Logistic Regression: Logistic regression is a popular algorithm for binary or multiclass classification problems. It models the relationship between the input features and the probability of an instance belonging to a particular class. Logistic regression uses a logistic function (sigmoid) to map the input features to the probability values. The decision boundary can be adjusted to classify instances based on a threshold. It is commonly used in sentiment analysis, spam detection, or disease diagnosis.
  3. Decision Trees: Decision trees are versatile algorithms that can be used for both classification and regression tasks. They build a tree-like model by making a sequence of decisions based on the input features. Each internal node represents a decision based on a feature, and each leaf node represents a class or a predicted value. Decision trees can handle both categorical and numerical features and are interpretable. They are suitable for tasks such as customer segmentation, credit scoring, or predicting patient outcomes.
  4. Random Forest: Random Forest is an ensemble method that combines multiple decision trees to improve predictive accuracy. It reduces overfitting by aggregating predictions from different trees. Each tree is trained on a subset of the data with bootstrapping (random sampling with replacement), and a random subset of features is considered at each node. Random Forest is effective for tasks such as fraud detection, recommendation systems, or medical diagnosis.
  5. Support Vector Machines (SVM): Support Vector Machines are powerful algorithms used for both classification and regression. They create a hyperplane or set of hyperplanes to separate instances of different classes. SVM aims to maximize the margin (distance) between the hyperplane and the nearest instances of each class. It can handle linear and nonlinear decision boundaries using kernel functions. SVM is suitable for tasks such as image classification, text categorization, or gene expression analysis.
  6. Naive Bayes: Naive Bayes is a probabilistic algorithm based on Bayes’ theorem with an assumption of feature independence given the class. It calculates the probability of an instance belonging to a particular class and assigns the class label with the highest probability. Naive Bayes is computationally efficient and performs well on text classification tasks. It is widely used in spam filtering, sentiment analysis, or document classification.
  7. Gradient Boosting: Gradient Boosting is an ensemble method that combines weak learners (usually decision trees) in a sequential manner, where each new model corrects the errors made by the previous models. It minimizes the loss function by adjusting the weights of instances at each iteration. Gradient Boosting is known for its high predictive accuracy and is effective for tasks such as click-through rate prediction, ranking, or anomaly detection.

Unsupervised Learning Algorithms

  1. K-means Clustering: K-means clustering is an algorithm that groups data points into k clusters based on their similarity. It aims to minimize the distance between the data points within a cluster and maximize the distance between different clusters. The algorithm iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence. K-means clustering is useful for tasks such as customer segmentation, image compression, or document clustering.
  2. Hierarchical Clustering: Hierarchical clustering builds a hierarchy of clusters by successively merging or splitting them based on their similarity. It does not require specifying the number of clusters in advance. The result is often visualized as a dendrogram, which shows the merging process and allows selecting the desired number of clusters. Hierarchical clustering is suitable for tasks such as taxonomy creation, gene expression analysis, or social network analysis.
  3. Principal Component Analysis (PCA): PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional representation while retaining most of the original information. It achieves this by finding the principal components, which are linear combinations of the original features that capture the maximum variance in the data. PCA is often used for data visualization, noise reduction, and feature extraction. It is applied in tasks such as face recognition, stock market analysis, or bioinformatics.
  4. DBSCAN: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups data points into clusters based on density. It is effective in identifying clusters of arbitrary shape and can handle noisy data. DBSCAN defines dense regions as clusters and separates sparse regions as noise. It is useful for tasks such as anomaly detection, identifying hotspots in spatial data, or image segmentation.
  5. Association Rule Learning: Association rule learning discovers interesting relationships or patterns in data. It identifies associations between items based on their co-occurrence and calculates measures such as support, confidence, and lift. Association rules are commonly used in market basket analysis (e.g., “If a customer buys product A, they are likely to buy product B”), recommendation systems, or web clickstream analysis.
  6. t-SNE: t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique used for visualizing high-dimensional data in two or three dimensions. It preserves the local and global structure of the data, making it suitable for visualizing clusters or similarities. t-SNE is often applied in tasks such as visualizing word embeddings, analyzing gene expression data, or exploring high-dimensional data in general.
  7. Autoencoders: Autoencoders are neural network-based models used for unsupervised learning and dimensionality reduction. They consist of an encoder and a decoder network that reconstructs the input data. By learning to compress and reconstruct the data, autoencoders can capture the underlying structure and extract meaningful features. Autoencoders are effective for tasks such as anomaly detection, image denoising, or learning representations in generative models.

These are just some of the commonly used machine learning algorithms within the supervised and unsupervised learning categories. Each algorithm has its strengths and weaknesses, and the choice of algorithm depends on the specific problem, data characteristics, and desired outcome. It is important to experiment with different algorithms and evaluate their performance to determine the most suitable approach for a given task.

--

--

Jessic Akushey

Data scientist with years of experience solving business problems using cutting-edge technologies such as deep learning and natural language processing