A Machine Learning Pipeline with Scikit-Learn

Thomas Le Montagner
14 min readMar 3, 2023

--

A complete Machine Learning Pipeline with Scikit-Learn for ML enthusiasts

Credit: LoggaWiggler from Pixabay

Table of Contents

This is quite a long and extensive post. So take your time and come back to it as many times as necessary. Here is the outlines of this post so you can navigate through it more easily:

I. Introduction

II. Data Preprocessing

III. Feature Engineering

IV. Model Selection and Hyperparameter

V. Model Evaluation

VI. Visualizing Results

VII. Conclusion

I. Introduction

Welcome to this tutorial on machine learning pipelines with Scikit-Learn! As someone who’s interested in data science, you know that machine learning is one of the most powerful tools available for making sense of large amounts of data. But if you’re just starting out, you might be wondering where to begin. That’s where Scikit-Learn comes in.

Scikit-Learn is a Python library that provides a simple and efficient way to build machine learning models. One of its key features is the ability to create machine learning pipelines. A machine learning pipeline is a set of processing steps that transform raw data into a final model that can be used to make predictions on new data.

In this tutorial, we’ll walk you through the steps for building an end-to-end machine learning pipeline using Scikit-Learn. We’ll cover everything from data preprocessing and feature engineering to model selection, hyperparameter tuning, and evaluation. By the end of this tutorial, you’ll have a solid understanding of how to build machine learning pipelines and how Scikit-Learn can make this process easier and more efficient. So, let’s get started!

II. Data Preprocessing

Before we dive into building a machine learning pipeline, we need to make sure our data is clean and well-prepared. This is where data preprocessing comes in.

A. Loading the data

The first step in data preprocessing is loading the data into our Python environment. Depending on the type of data you’re working with, this might involve reading in a CSV file, connecting to a database, or scraping data from the web.

For this tutorial, we’ll be using the iris dataset, which is included in Scikit-Learn. The iris dataset contains measurements of the sepal length, sepal width, petal length, and petal width for three different species of iris flowers (setosa, versicolor, and virginica).

To load the iris dataset into our Python environment, we can use the following code:

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
y = pd.DataFrame(data=iris.target, columns=['target'])

This code loads the iris dataset and splits it into the input features (stored in X) and the target variable (stored in y).

B. Splitting the data into training and testing sets

Once we’ve loaded the data, we need to split it into training and testing sets. The training set is used to train our machine learning model, while the testing set is used to evaluate its performance. This is important to avoid overfitting, which occurs when a model performs well on the training set but poorly on new data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# .ravel will convert that array shape to (n, ) (i.e. flatten it)
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

Here, we’re using a test size of 0.2, which means that 20% of the data will be used for testing, and 80% will be used for training. We’re also setting the random state to 42, which ensures that we get the same split every time we run the code.

Now that we have our data split into training and testing sets, we can move on to data preprocessing in the next section.

C. Importance of data preprocessing

Data preprocessing is a critical step in any machine learning pipeline. The goal of data preprocessing is to prepare the data for machine learning by handling missing values, scaling the features, and transforming the data in other ways that make it easier for our machine learning algorithms to learn patterns and make accurate predictions.

Without proper data preprocessing, our machine learning model might struggle to perform well and might not generalize to new data. Here are some common data preprocessing steps:

  1. Handling missing values: When dealing with real-world data, it’s common to encounter missing values. We need to decide how to handle these missing values, either by dropping the missing values, imputing them with a certain value, or using a more complex method such as K-nearest neighbors (KNN) imputation. Fortunately, the iris dataset doesn’t have any missing values, so we don’t need to worry about this step.
  2. Scaling and normalizing features: Scaling the features ensures that they are all on the same scale, which is important for many machine learning algorithms. Common scaling techniques include standardization and min-max scaling. The sepal length and width and petal length and width in the iris dataset are measured in centimeters, but they have different ranges. For example, sepal length ranges from 4.3 to 7.9, while petal width ranges from 0.1 to 2.5. Therefore, we might want to scale the features to have zero mean and unit variance using Scikit-Learn’s StandardScaler:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
  • Encoding categorical variables: If we have categorical variables in our dataset, we need to encode them numerically to be able to use them in our machine learning models. Common encoding techniques include one-hot encoding and label encoding. The target variable in the iris dataset is categorical, with three possible values: setosa, versicolor, and virginica. We can use Scikit-Learn’s LabelEncoder to encode these categories numerically:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)
  • Handling outliers: Outliers are extreme values that can skew our data and lead to poor performance in our machine learning models. We need to decide how to handle these outliers, either by removing them or using a more robust model that can handle them. The iris dataset doesn’t have any obvious outliers, so we don’t need to worry about this step.
  • Feature selection: If we have a large number of features in our dataset, we might want to select only the most important ones. This can help us reduce the dimensionality of the data and improve the performance of our machine learning models. The iris dataset only has four features, so we might not need to do any feature selection. However, if we had more features, we could use techniques like principal component analysis (PCA) or feature importance to select the most important ones.

By taking care of these data preprocessing steps, we can ensure that our data is clean and well-prepared for machine learning. In the next section, we’ll dive into feature engineering, which is the process of creating new features from the existing ones.

III. Feature Engineering

A. Introduction to feature engineering

Feature engineering is the process of creating new features from the existing ones. The goal of feature engineering is to create features that are more informative and relevant to our machine learning task. By creating these new features, we can provide our machine learning models with more information that might help them make better predictions.

B. Examples of feature engineering techniques

With the iris dataset, we only have four features: sepal length, sepal width, petal length, and petal width. However, we can create new features that might be more useful for our machine learning models. Here are some examples:

  1. Petal area: We can calculate the petal area by multiplying the petal length and width. This might be a more informative feature than the individual petal length and width measurements.
  2. Sepal ratio: We can calculate the ratio of sepal length to sepal width. This might be a useful feature because it captures the shape of the sepal.
  3. Petal length ratio: We can calculate the ratio of petal length to sepal length. This might be a useful feature because it captures the relative size of the petal.

C. Implementing feature engineering using Scikit-Learn

We can easily implement these feature engineering techniques using Scikit-Learn. Here’s an example of how to create the petal area feature:

X_train['petal_area'] = X_train['petal length (cm)'] * X_train['petal width (cm)']
X_test['petal_area'] = X_test['petal length (cm)'] * X_test['petal width (cm)']

Similarly, we can create the sepal ratio and petal length ratio features:

X_train['sepal_ratio'] = X_train['sepal length (cm)'] / X_train['sepal width (cm)']
X_test['sepal_ratio'] = X_test['sepal length (cm)'] / X_test['sepal width (cm)']

X_train['petal_length_ratio'] = X_train['petal length (cm)'] / X_train['petal width (cm)']
X_test['petal_length_ratio'] = X_test['petal length (cm)'] / X_test['petal width (cm)']

D. Creating a pipeline for feature engineering

We can also include these feature engineering steps as part of our machine learning pipeline. Here’s an example of how to create a pipeline that includes feature engineering and a machine learning model:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

# Define the pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('logreg', LogisticRegression())
])

# Fit the pipeline
pipe.fit(X_train, y_train)

# Evaluate the pipeline
score = pipe.score(X_test, y_test)
print(f"Pipeline accuracy: {score}")

# Result:
# Pipeline accuracy: 0.9

By including feature engineering as part of our pipeline, we can automate the process of creating new features and provide our machine learning models with the most relevant information. In the next section, we’ll cover model selection, which involves choosing the best machine learning algorithm for our task.

IV. Model Selection and Hyperparameter Tuning

A. . Explanation of model selection and hyperparameter tuning

Model selection is the process of choosing the best machine learning algorithm for our task. There are many algorithms available in Scikit-Learn, each with its own strengths and weaknesses. The goal of model selection is to find the algorithm that performs the best on our data.

Hyperparameter tuning is the process of choosing the best values for the hyperparameters of our machine learning algorithm. Hyperparameters are parameters that are set before training our algorithm, such as the learning rate, regularization strength, and the number of hidden layers in a neural network. Choosing the best values for these hyperparameters can have a significant impact on the performance of our algorithm.

B. Examples of model selection techniques

There are many model selection techniques available in Scikit-Learn, each with its own advantages and disadvantages. Here are some common techniques:

  1. Cross-validation: Cross-validation involves splitting our data into training and validation sets, and evaluating our algorithm on each fold of the data. This can help us get a better estimate of how our algorithm will perform on new data.
  2. Grid search: Grid search involves creating a grid of hyperparameter values, and evaluating our algorithm on each combination of hyperparameters. This can help us find the best combination of hyperparameters for our algorithm.
  3. Random search: Random search involves randomly sampling hyperparameter values from a distribution, and evaluating our algorithm on each sample. This can be faster than grid search and can sometimes find better hyperparameters.

C. Examples of hyperparameter tuning techniques

Hyperparameter tuning involves the process of selecting the optimal set of hyperparameters that maximize the performance of the model on the test data. Hyperparameters are not learned from the data but are set prior to training the model. Here are some common techniques for hyperparameter tuning:

  1. Grid search: As mentioned earlier, grid search involves creating a grid of hyperparameter values and evaluating our algorithm on each combination of hyperparameters. This can be an exhaustive search and may not be feasible for large hyperparameter spaces.
  2. Random search: Random search involves randomly sampling hyperparameter values from a distribution, and evaluating our algorithm on each sample. This can be more efficient than grid search, especially for large hyperparameter spaces.
  3. Bayesian optimization: Bayesian optimization is a more advanced technique that uses Bayesian statistics to optimize hyperparameters. It models the performance of our algorithm as a function of the hyperparameters and chooses the next hyperparameter values to evaluate based on the previous results.

Scikit-learn provides a variety of tools for hyperparameter tuning, including GridSearchCV and RandomizedSearchCV for grid search and random search, respectively, and BayesSearchCV for Bayesian optimization. These tools can be used in combination with cross-validation to select the optimal set of hyperparameters.

D. Implementing model selection and hyperparameter tuning using Scikit-Learn

We can easily implement these techniques using Scikit-Learn. Here’s an example of how to perform cross-validation and hyperparameter tuning using grid search:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the hyperparameters to search
params = {'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10]}

# Define the estimator to use
estimator = RandomForestClassifier()

# Define the grid search object
grid_search = GridSearchCV(estimator, params, cv=5)

# Fit the grid search object to the data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and score
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

# Results:
# Best hyperparameters: {'max_depth': 5, 'n_estimators': 10}
# Best score: 0.9583333333333334

This code performs a grid search over the n_estimators and max_depth hyperparameters for a random forest classifier. The cv parameter specifies the number of folds for cross-validation.

E. Creating a pipeline for model selection and hyperparameter tuning

We can include model selection and hyperparameter tuning as part of our machine learning pipeline. Here’s an example of how to create a pipeline using Scikit-Learn.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Create a pipeline with preprocessing, feature reduction, and model
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('svm', SVC())
])

# Define the hyperparameters to tune
param_grid = {
'svm__C': [0.1, 1, 10],
'svm__kernel': ['linear', 'poly', 'rbf'],
'svm__gamma': ['scale', 'auto']
}

# Use GridSearchCV to find the best hyperparameters
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

# Evaluate the performance of the best model on the test set
score = grid.score(X_test, y_test)
print("Test score:", score)

# Result:
# Test score: 0.9

In this example, we create a pipeline that scales the data, reduces the dimensionality using PCA, and fits a support vector machine classifier. We then use GridSearchCV to find the best hyperparameters for the classifier. Finally, we evaluate the performance of the best model on the test set.

By using a pipeline, we can easily apply multiple preprocessing and feature engineering techniques, try different models, and perform hyperparameter tuning without repeating code. This can save time and reduce errors in the machine learning workflow.

V. Model Evaluation

A. Importance of model evaluation

Model evaluation is a crucial step in the machine learning pipeline. It helps us to assess the performance of our model on unseen data and determine whether it generalizes well to new examples. Without proper evaluation, we may end up with a model that performs well on the training data but poorly on new data.

B. Examples of evaluation metrics

There are many evaluation metrics that we can use to measure the performance of our model. Here are a few examples:

  • Accuracy: The percentage of correctly classified examples.
  • Precision: The percentage of true positives among the total predicted positives.
  • Recall: The percentage of true positives among the total actual positives.
  • F1-score: The harmonic mean of precision and recall.
  • AUC-ROC: The area under the receiver operating characteristic (ROC) curve.

The choice of evaluation metric depends on the problem we are trying to solve and the type of data we have.

C. Implementing model evaluation using Scikit-Learn

Scikit-Learn provides a wide range of functions for evaluating machine learning models. Here’s an example of how to use the accuracy_score, precision_score, recall_score, f1_score, and roc_auc_score functions in Scikit-Learn to evaluate the performance of a classifier.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.svm import SVC

# Train a support vector machine classifier
clf = SVC()
clf.fit(X_train, y_train)

# Use the classifier to make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average=None)
rec = recall_score(y_test, y_pred, average=None)
f1 = f1_score(y_test, y_pred, average=None)

print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-score:", f1)

In this example, we train a support vector machine classifier and use it to make predictions on the test set. We then use the Scikit-Learn functions to calculate the accuracy, precision, recall, F1-score, and AUC-ROC of the predictions.

By evaluating our model using various metrics, we can gain a better understanding of its strengths and weaknesses and make informed decisions about how to improve it.

VI. Visualizing Results

A. Introduction to visualizing results

After we have built and evaluated our machine learning model, it’s important to communicate the results to stakeholders or even ourselves in a way that is understandable and visually appealing. Visualizing the results can provide insights that are difficult to see from just looking at numerical metrics alone.

B. Examples of visualization techniques

There are various visualization techniques that can be used to showcase the results of a machine learning model. Some examples include:

  • Confusion matrix: A confusion matrix is a table that is used to evaluate the performance of a classifier. It shows the number of true positives, true negatives, false positives, and false negatives.
  • ROC curve: A ROC (Receiver Operating Characteristic) curve is a plot that shows the performance of a binary classifier as the threshold value is varied. It shows the trade-off between the true positive rate and the false positive rate.
  • Feature importance plot: A feature importance plot can be used to visualize the importance of each feature in a machine learning model. It can help us identify the most important features and remove the ones that are not contributing much to the model’s performance.
  • Learning curve: A learning curve is a plot that shows the training and validation accuracy of a model as the number of training examples is varied. It can help us identify if the model is underfitting or overfitting.

C. Implementing visualization using Scikit-Learn

Scikit-Learn provides various tools for visualizing machine learning results. For example, the ConfusionMatrixDisplay() function can be used to plot a confusion matrix. The RocCurveDisplay.from_predictions() function can be used to plot a ROC curve.

Here’s an example of how to use the ConfusionMatrixDisplay() function:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from matplotlib import pyplot as plt

# Create the Confusion Matrix
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)

# Create the diplsay of the Confision Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=clf.classes_)

disp.plot()
plt.show()

This will plot the confusion matrix for the test set.

In summary, visualizing the results of a machine learning model can help us better understand its performance and communicate it to stakeholders effectively. Scikit-Learn provides various tools for visualizing machine learning results that can be easily implemented in Python.

VII. Conclusion

In this tutorial, we walked through the process of building an end-to-end machine learning pipeline using Scikit-Learn. We started with loading and preprocessing data, then moved on to feature engineering, model selection and hyperparameter tuning, model evaluation, and visualizing results.

While the steps we covered may seem overwhelming, breaking them down into manageable pieces and using a pipeline approach can make the process much smoother. By taking the time to carefully preprocess and engineer our data, choose the right model and tune its hyperparameters, and evaluate our model’s performance, we can build machine learning models that perform well on real-world data.

Remember, while Scikit-Learn provides a lot of powerful tools, it’s still up to us as data scientists to use these tools effectively and make the best decisions for our particular problem. Hopefully, this tutorial has given you a good starting point for building your own machine learning pipelines and has inspired you to continue learning and exploring the exciting field of machine learning.

--

--