Understanding Baseline Models in Machine Learning

Importance, Strategies, and Application to Imbalanced Classes

4 min readMay 17, 2023

Introduction to Baseline Models

A baseline model is a simple model used to predict the outcome of data. It serves as a starting point for analysis, allowing us to assess the performance of more complex models and the impact of additional features. There are two common approaches to creating a baseline model for classification.

Majority class classifier : where the most frequent class in the data is predicted for all observations. For instance, if we have 80% of observations in class A and 20% in class B for a binary classification problem, the baseline model would predict Class A for all instances.

Random classifier : Randomly assigning class labels based on the class distribution in the data. In the aforementioned binary classification scenario, we would assign class A to 80% of the observations and class B to 20% of the observations randomly.The random classifier is particularly useful when there is no specific guidance or knowledge available to make informed predictions.

Importance of Baseline Models in Machine Learning

Baseline models serve as a reference point in machine learning tasks and offer several benefits. Here are the key reasons for using baseline models:

Performance comparison: Baseline models provide a basis for comparing the performance of more advanced models. They help determine if the complexity of a model translates into improved performance. If a model fails to outperform the baseline, it suggests issues with the approach or data.
Minimum performance requirement: Baseline models set a minimum performance requirement for any useful model. If a complex model cannot surpass the baseline’s performance, it may not be worthwhile to implement it practically.
Decision-making: Baseline models aid in resource allocation, model selection, and further model improvement. If the baseline model already achieves satisfactory performance, additional time and resources may not be necessary for building more complex models.

Baseline Models for Imbalanced Classes

A baseline model, like a dummy classifier, is useful for detecting imbalanced classes by providing a comparison point. It allows us to assess the performance of more advanced models in the context of imbalanced data.

Imbalanced classes often lead to the majority class dominating predictions, resulting in high accuracy but poor identification of the minority class. A baseline model helps establish the expected performance level using a random or simplistic approach. Say a new kind of class emerges in your data pipeline , this new class would go unnoticed as the ratio to the existing class differs by a huge margin .

Creating a baseline model helps determine its accuracy in predicting the majority class and serves as a starting point for evaluating complex models. If an advanced model fails to outperform the baseline, it suggests ineffective handling of the imbalanced class issue.

The baseline classifier, such as a dummy classifier with the ‘most_frequent’ strategy, is suitable for detecting imbalanced classes in binary classification. It predicts the most frequent class for all instances, essentially ignoring the minority class and introducing bias towards the majority class.

When evaluating the baseline classifier’s performance on imbalanced classes, relying solely on accuracy may not provide an accurate representation. We will delve deeper into this topic in the next article.

Strategies in Dummy Classifier: Exploring Different Approaches

The scikit-learn library’s DummyClassifier class offers various strategies for generating predictions. These strategies are designed to create simple baseline models for comparison with more advanced models. Here are some commonly used strategies:

“stratified”: This strategy randomly selects class labels based on the class distribution in the training set. It aims to maintain the same class distribution as the training data, making it useful for imbalanced classes.
“most_frequent”: This strategy always predicts the most frequent class in the training set. It is suitable for imbalanced datasets where the majority class dominates the distribution. It provides a baseline performance level based on the most common class, without considering input features.
“uniform”: This strategy assigns class labels randomly and uniformly, without considering the class distribution in the training data. It is useful when there is no specific pattern or information to guide the predictions.
“constant”: This strategy always predicts a constant class label specified by a constant parameter. It helps create a baseline model that consistently predicts a particular class. It aids in evaluating the impact of class imbalance and comparing model performance against a fixed prediction.

By selecting an appropriate strategy, you can create a baseline classifier that reflects different aspects of the data, such as class distribution, majority class dominance, randomness, or a fixed prediction.

A sample baseline classifier for Breast Cancer dataset .

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a baseline random classifier
dummy_clf = DummyClassifier(strategy='stratified', random_state=42)

# Fit the baseline classifier on the training data
dummy_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = dummy_clf.predict(X_test)

# Calculate accuracy and other metrics
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Print the results
print("Baseline Classifier Accuracy:", accuracy)
print("Classification Report:")
print(report)

you can check the entire code here at GitHub — Baseline Classifier