Streamlining Your Machine Learning Workflow with scikit-learn Pipelines

Kendall McNeil
9 min readOct 13, 2023

--

Machine learning projects can quickly become complex, with multiple data preprocessing steps, feature engineering, model selection, and evaluation. Keeping everything organized and error-free is key to building a strong model and can present a lot of challenges. That is where pipelines come in.

A pipeline in scikit-learn is like having a personal choreographer for your machine learning workflow. Let’s name him Pip. Imagine your data, models, and processes as performers on a stage, and Pip is the director, making sure everything flows seamlessly and harmoniously.

Pip knows each step by heart, from data preprocessing to model training, and evaluation. Just like a conductor guiding an orchestra, Pip ensures that every component plays its part at the right moment, creating a beautiful symphony of data science. Pip makes your code dance to a well-orchestrated tune, making it easier for you to create, manage, and optimize your machine learning projects.

In this guide, we’ll introduce you to Pip, your new companion in the world of scikit-learn pipelines. You’ll learn how to harness Pip’s power to simplify your workflow, reduce errors, and streamline your code. Get ready to see your machine learning projects come to life with Pip as your trusty choreographer.

I. Understanding the Data

The first step in creating a pipeline in scikit-learn is to gain a deep understanding of the data. To make Pip tangible, we will use a recent machine learning model I built to predict individuals who will not get an H1N1 (colloquially named “swine flu”) vaccine (full GitHub page linked here).

Let’s dive in. First, let’s run a comprehensive chunk of code for imports.

After importing the data into a dataframe under the alias “df,” the first thing we can do to get a sense of the data types or “dtypes” is to run info(). This df contains a mixture of integers, floats, and object types. Lots of numbers!

Or are they….? How is a column like “h1n1_concern_level” a float data type? This is where really understanding the data becomes crucial for building durable pipelines.

I. Organizing the Data

This initial step will be the heavy lifting of our machine learning model before we let sci-kit learn pipelines do the rest!

The comprehensive data dictionary provides descriptions for each column. There we can see the data contains:

· 16 binary columns (no is encoded as 0; yes as 1)

· 8 scaled columns (6 are a scale of 0–5; the other two are 0–2 and 0–3)

· 9 categorical columns (age group, education group, income group, etc.)

· 2 integer columns (number of adults and children in the home from 0–3)

Columns will need to have a respective “subpipe” based on the cleaning and feature engineering it may require. Therefore, let’s begin by creating lists of our columns before we perform our “train_test_split.”

a. First, we have some binary columns that will need to be one-hot-encoded because they are entered as objects (“Male,” “Female,” “Married,” “Not Married,” etc.).

b. Next, we have some columns that only need null values taken care of. We can pass SimpleImputer() to fill in null values according to a strategy we will specify later. For now, let’s call this list “simple_impute_only.” These columns include those that are on a scale and our columns that are yes or no questions encoded as 0 or 1.

If the columns on a scale were not encoded as numbers already, we would use OrdinalEncoder() in a separate list to transform them. For example, if for “h1n1_concern,” the entry was “Not Effective” instead of 0 then that would be an opportunity to use OrdinalEncoder(), which is used when an ordinal relationship exists in the data.

c. Last, we have the categorical columns. Some of these columns, such as “income_poverty” may appear to be numeric data, but when we look closely at the data, we see that these are split into income categories, such as <$75,000. Additionally, we will make “household_adults” and “household_children” categorical given that we cannot have 2.5 children and the column caps out at 3.

The big takeaway here is that when we initially looked at df.info(), it appeared on the surface that all we had was numeric data, however, when we dig deeper into the data, we see that we do not have any data that should be treated as numeric data at all.

II. Subpipes Set Up (full documentation linked here)

Now that we have gained a deep understanding of the data and divvied the columns into respective lists, we can build our subpipes based on the data cleaning and feature engineering each list requires. We must pass steps into our subpipes, which is a list of each of our transformers. For each transformer, we must name it and instantiate the transformer with any necessary specifications.

As you see, we will use SimpleImputer() to fill nulls for every subpipe. This is an awesome tool for handling null values, but of course, should be used with care. If there were numeric data, we could use other strategies as well, such as mean or median, depending on our data. The default strategy is mean.

We then will set up our ColumnTransformer(), which will take in a name we give it, our subpipe, and our list of columns for each data type we have specified: categorical, binary, and simple impute only. We will also specify “remainder=’passthrough.’” This specifies that any columns not included are to be passed over. This is not necessary here as we have included all our columns, but an important attribute to mention.

III. Pipelines Set Up

Now is where all the fun begins! We can begin training our models, starting of course with our dummy or baseline model. All we need to do is create a dummy pipeline by naming and calling our column transformer then adding a DummyClassifier(). Then we will fit the model to our training data and check out it’s performance.

For our baseline model, the dummy classifier will simply select the most frequent class label. As is the objective of a dummy model, our future models must perform better.

Then we can build our first simple model using LogisticRegression() with only a few lines of code! First create the pipeline, fit the model to the training data, and print training and test set scores.

We can then plot an ROC curve, generate predictions, and print our accuracy, recall, precision, and f1 score to obtain a more holistic understanding of the model performance. Surprisingly strong for our first simple model!

We see that recall rate is looking particularly high, which is great for this business question hoping to minimize false negatives and identify as many individuals unlikely to get a vaccine as possible.

IV. GridSearch (full documentation linked here)

Another tool worth mentioning is GridSearch. Tuning the hyperparameters can make a surprising difference in machine learning and GridSearch helps us identify the most effective hyperparameters for our model.

We can specify several hyperparameters we want the model to test in a dictionary under the alias “params.” Then GridSearch will do all the work to test every combination of hyperparameters then report the strongest parameter combination. A GridSearch can only perform as well as the hyperparameters fed to it, so it is particularly important to give the created dictionary some thought.

To run a GridSearch, you need to pass the model pipeline and params. CV represents the cross validation splitting strategy. The default is 5, included here as a demonstration, and may be helpful to adjust. Verbose specifies how much information about the GridSearch we are to be clued in on. Specifying 1 shows the computation time for each fold performed.

Depending on how many hyperparameters you specify, a GridSearch can take anywhere from a few minutes to hours. So go ahead, grab a snack, and sit back and enjoy the fruit of your labor while sci-kit learn does all the heavy lifting. If you are coding along, this GridSearch takes about 25 minutes.

To see the results, you can call “best_params_” and “best_estimator_.”

You can see this just barely improved our model in this example, but did improve it, nonetheless.

We can then look at the ROC Curve and the model’s accuracy, recall, precision, and f1 score. Let’s throw a confusion matrix in there too for a visualization of these scores!

As you can see, a strongly formed pipeline can have great results for your machine learning model and allows you to test out countless models with only a few lines of code, while keeping your workflow pristine.

V. Numeric Columns

As mentioned earlier, this example did not include any continuous numeric data, such as height or weight. To build a pipeline for numerical columns, we would simply add a list of “numerical_columns” then create a “numerical_subpipe” that can also be passed into our Column Transformer. The transformers in the numeric subpipe would likely include a SimpleImputer(), specifying a strategy of mean or median, and a StandardScaler() to scale all the data using a z-score. You can of course pass in any additional transformers that your data may require.

VI. Conclusion

In the world of machine learning, organization, efficiency, and accuracy are paramount. A scikit-learn pipeline, or as I affectionately call it, Pip, is the unsung hero that brings order and harmony to the often complex symphony of data science.

By understanding your data, creating subpipes, and building a structured pipeline, you’ve equipped yourself with the tools to streamline your machine learning projects. Pipelines eliminate the need for manual intervention at each stage, making your code cleaner and more efficient.

So, as you embark on your journey, remember that Pip, while perhaps intimidating at first, is your ally, simplifying your workflow, reducing errors, and helping you achieve better model performance. With the power of scikit-learn pipelines at your fingertips, you can confidently tackle the most complex machine learning tasks and enjoy the fruits of your labor during GridSearches.

Happy coding and may your models shine brightly on the stage of data science!

--

--

Kendall McNeil

Aspiring Data Scientist (Python & SQL) || Project Management, People Management, and Research Experience || Unapologetic Bookworm