Master Machine Learning: 4 Classification Models Made Simple

A Beginner’s Guide to Building Models in 15 Practical Steps

@panData
Towards Data Science
56 min readDec 14, 2024

--

Overview

I will now present a template for the process of building machine learning models.

In this project, the goal is to create a model capable of predicting the need for maintenance in industrial machines. We’ll use data from IoT sensors (Internet of Things).

The approach divides the Machine Learning project into 15 distinct stages, which I will outline for you. These stages include the key techniques, main strategies, and how to tackle each of them effectively.

As a demonstration, we will work with fictitious data in this example.

As we progress, we will build a comprehensive project from start to finish, covering everything from problem definition to deploying the functional model.

Tools Used in the Project

Every machine learning project is inherently a data science project, but not every data science project involves machine learning. When working with ML, you are engaging in a specific subset of data science.

This project focuses on examining the machine learning process in detail. Larger data science projects may include tasks such as metric calculations, dashboard creation, data visualizations, or storytelling, which may or may not involve machine learning.

Here, the goal is to explore the steps required to build a complete machine learning model, starting from business problem definition to deployment.

While many steps, such as problem definition and data understanding, are common to most data science projects, others — like cross-validation and model selection — are exclusive to machine learning.

Let’s proceed to the notebook and begin by installing and loading the necessary tools and packages. Start by installing the watermark package:

# Install the `watermark` to record the versions of other packages
!pip install -q -U watermark

And next, we will install the XGBoost package:

# Install the `xgboost` package, used for gradient boosting algorithms
!pip install -q xgboost

This is one of the algorithms we will use in this project. In fact, I will create at least three versions of the model using different algorithms. Specifically, I will work with:

  1. Logistic Regression
  2. Naive Bayes, a probabilistic algorithm
  3. XGBoost, from which we will import and use the XGBClassifier.
# 1.Import

# For object serialization
import pickle

# Scikit-learn library
import sklearn as sk

# For DataFrame manipulation
import pandas as pd

# For numerical operations
import numpy as np

# For statistical visualizations
import seaborn as sns

# For plotting graphs
import matplotlib.pyplot as plt

# For machine learning models
import xgboost as xgb

# For XGBClassifier
from xgboost import XGBClassifier

# For logistic regression
from sklearn.linear_model import LogisticRegression

# For Naive Bayes classification
from sklearn.naive_bayes import GaussianNB

# For data scaling
from sklearn.preprocessing import StandardScaler

# For cross-validation and hyperparameter tuning
from sklearn.model_selection import cross_val_score, GridSearchCV

# For model evaluation metrics
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, roc_curve

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Display plots in Jupyter/Colab
%matplotlib inline

XGBoost is typically not included in the Anaconda Python distribution, so it must be installed separately before use.

In this project, the following tools will also be utilized:

  • pickle: To save both the model and the scaler to disk for future use.
  • sklearn: To leverage various algorithms, utility functions, and to compute evaluation metrics.
  • numpy and pandas: Widely used libraries for efficient data manipulation.
  • matplotlib: For creating insightful graphical visualizations.

These tools collectively encompass everything necessary to work effectively on machine learning tasks.

After installing and loading the necessary packages, activate the watermark package to ensure version tracking:

# Activate the watermark package
%reload_ext watermark
%watermark -a "YourName"

These are the essential tools we will use throughout this project. Together, we will embark on a journey to build a machine learning model, providing you with a template to apply in similar contexts.

1. Defining and Understanding the Problem

1.1 The Importance of Knowing Your Destination

Imagine waking up in the morning, heading to the nearest airport, approaching an airline counter, and asking the attendant, “I’d like a ticket for the next available flight, please.” What’s the first thing they’ll ask? Where are you going? That’s how it works, right? To know which path to follow, you need to know your destination.

This is the first step in any machine learning or data science project: you must clearly define the problem to be solved. Only then can you outline the path, choose the tools, select the metrics, execute the procedure, interpret the results, and finally deliver the outcome.

Yes, I know it seems obvious, but sometimes the obvious needs to be stated. There are many people who start without a clear direction, thinking, “I’ll figure it out along the way.” But that’s not the best approach, especially in data science. You must define the objective first.

1.2 The Objective of This Project

In this project, our objective is clear: to predict whether an industrial machine needs maintenance using 178 sensor readings from IoT devices. That's the sole focus of our machine learning model. It won't do anything else — it's designed to solve this specific problem.

1.3 Why Problem Definition Matters

For everything else — choosing algorithms, designing data cleaning strategies, creating visualizations, interpreting results, deploying the solution — everything depends on this first step.

When you’re starting a project or faced with a business scenario, ask yourself:

  • What problem are we trying to solve?
  • What is the objective?

Not all business stakeholders will have clear answers. Sometimes, they might not even know exactly what they need. That’s where you come in. Your job is to understand the problem and propose a solution.

1.4 Final Thoughts

The first step in any project is to clearly and deeply understand the business problem. If it’s ambiguous, you’re starting on the wrong foot, and issues will arise later — possibly as soon as the data preparation phase.

Remember, machine learning models are specialized tools for solving specific tasks. Organizations can build as many models as needed, with each model tailored to a unique problem. Defining the problem properly is the foundation of success.

2. Understanding the Data

2.1 The Importance of Data in Machine Learning

Machine learning relies entirely on data — data is our raw material. If your company doesn’t have accessible data, it’s seriously behind and needs to act quickly.

“Hey, let’s get to work, start collecting data immediately, build a pipeline, assemble a Data Science team!”

In today’s world, data drives everything. Without it, constructing solutions based on data science and machine learning isn’t even feasible.

Machine learning involves using an algorithm — a set of mathematical operations — and training it on data to create a model.

Algorithms are widely available, some dating back to the 1980s, and we still use them today. But for these algorithms to work, what do we need? Data.

2.2 Ensuring Data Availability

The 2ⁿᵈ step in any machine learning project is understanding your data:

  • Are the data available? If not, how do you make them available?
  • Do you have a process for data collection or extraction?
  • Where are the data stored? Locally or in a cloud-based Data Lake House?
  • Is there a Data Engineer managing extraction, storage, and processing?
  • Are the data catalogued with proper metadata?
  • Have you addressed data security and privacy regulations?

These are basic questions, they are integral to almost every company.

2.3 Understanding the Data Source

Before you even write Python code, you can begin by examining the data dictionary, reading documentation, or discussing with the business team. Understand the data source and its structure.

In our case, we’ll work with historical data collected from IoT sensors in industrial machines. Each row in the dataset contains 178 sensor readings (columns). So, imagine each machine having 178 sensors, each generating a value. These readings were compiled into a dataset with:

  • 179 columns (178 sensor readings + 1 status column).
  • 11,500 rows, representing historical data for various machines.

The final column indicates the status of the machine: whether it needed maintenance or not.

Reminder: The data are fictitious, used here for learning, experimentation, and proof of concept. For real-world predictions or production deployment, you would need to work with real, historical data.

2.4 The Role of Data in Training the Model

To train the algorithm, we need historical data that tells us:

  • Which machines required maintenance based on sensor readings.
  • Whether a mathematical relationship exists in the data.

Machine learning doesn’t create patterns or relationships — it detects them if they exist. If patterns exist, we’ll know by evaluating the model’s metrics and performance.

2.5 Practical Business Applications

If the model identifies a pattern, it can predict whether a machine needs maintenance based on new IoT sensor readings. This capability is extremely valuable for industrial companies:

  • It reduces machine downtime.
  • Allows companies to plan maintenance more effectively:
  • Procure parts or hire specialized labor in advance.
  • Avoid simultaneous maintenance for multiple machines.

This approach optimizes operations, benefiting industries significantly.

While this example focuses on industrial settings, the same principles apply across sectors. Machine learning can solve problems in virtually any market or field — provided you have the raw material: data.

3. Loading the Data

3.1 Introduction to Data Loading

The next step is to bring the raw material — the data — into your working environment. This phase involves exploration, preparation, preprocessing, and eventually training the Machine Learning model.

Although seemingly simple, it encompasses several considerations:

  • Where will the data come from?
  • You might load data from CSV or TXT files, directly from Excel spreadsheets, or even connect to a database or Data Lake.
  • The idea is to fetch the data from its source, bring it into your environment, and start working with it.

3.2 Experimentation with a Data Sample

When starting a machine learning project, the initial work is typically experimental. This means:

  • You don’t need to work with the entire dataset right away.
  • A smaller sample of the data can suffice for initial exploration and validation of hypotheses.

Why Work with a Sample?

  • There’s no guarantee that the data contains patterns suitable for building a model.
  • Just having data doesn’t ensure a mathematical relationship or pattern that enables a model’s creation.

At this stage, we’re still in the theoretical domain:

  • We have a problem.
  • We have data.
  • Can we create a Machine Learning model from this data?

The only way to answer that question is by experimenting. Using a data sample for this purpose is both practical and efficient.

3.3 Validating the Theory

Begin with a data sample to test your theory:

  1. Determine if you can build a model.
  2. Assess its performance with the sample data.
  3. If successful, proceed to the final version using a larger dataset.

Some companies skip this step and start directly with the full dataset. While this is possible, it may lead to wasted time and computational resources. Validating the theory first is often a more strategic approach.

3.4 Example: Loading Data from a CSV File

To load a sample dataset, you might use a CSV file format as follows:

#2. Loading the dataset.
df = pd.read_csv("dataset.csv")

You check the shape:

# 3. Checking the dataset's shape
df.shape

# (11500, 179)

Take a look at the data

#4. Viewing sample records.
df.head()

Then, you begin exploring the data. Following that, you prepare and preprocess it, train several versions of the model with different algorithms, analyze the metrics, and only then will you be able to determine whether these data can be used to build a model.

Is the advice clear? This is a crucial step.

df.head()

Observe that here we have the variables, all of them, numbered as predictive variables, along with TARGET_VARIABLE. Notice that these variables do not have names; instead, they are coded from X1 to X178. Why? Each column represents an IoT sensor, and the values correspond to measurements.

You don’t necessarily need a description for each variable. Each variable is a reading from an IoT sensor. We have 178 sensors, each providing a reading that might represent temperature, humidity, machine speed, or any other metric depending on the information emitted by the sensor.

What if I want to study the relationship between the industrial machine’s temperature and the need for maintenance? Sure, that’s possible. But that would be a separate project.

What if I also want to study the relationship between the machine’s temperature and its operational speed? That’s another possibility. Excellent — now you have another project idea.

Remember the objective! Be cautious about this because it will happen in practice. Trust me; this comes from experience.

This project is specific — not a Holy Grail . It focuses on IoT sensor readings to predict whether a machine requires maintenance. If you want to study temperature, speed, or any other characteristic, open another project.

4. Exploratory Analysis & Target Variable Definition

Once you load the data, that’s when the real work begins. Notice that we have the variables X1 to X178, and then we have the target variable labeled TARGET_VARIABLE. Of course, it won’t always come with that name. You'll need to identify which one is the target variable.

The target variable is determined based on the business problem definition, which is why Step 1 is so important. What’s the goal here? To predict whether a machine needs maintenance.

So, we’ve gathered historical data, right? Do we have this information — whether the machine required maintenance in the past?

Yes, I have this information.

  • 0: The machine didn’t need maintenance.
  • 1: The machine needed maintenance.

Great! Now I’ll use this as the output variable, and all the others as input variables. The model will be trained to understand this relationship. If the model succeeds, we’ll have a good accuracy, as shown by the metric.

Once trained, the model will take new input data and predict the output.

This format is widely used in machine learning, especially in classification problems:

  • 0: Typically represents the negative class.
  • 1: Represents the positive class.

It’s important to clarify that this doesn’t carry any judgment of “good” or “bad.” This nomenclature is standard in data analysis:

  • 1 (positive class): Indicates the event occurred (in this case, the machine needed maintenance).
  • 0 (negative class): Indicates the event didn’t occur.

Now, let’s move forward and look at a statistical summary.

#5. Generating statistical summary.
df.describe()
df.describe()

Observe that I have a statistical summary for each variable. For all of them, I can see:

  • Count
  • Mean
  • Standard deviation
  • Minimum value
  • First, second (median), and third quartiles
  • Maximum value

At the end, there’s the target variable. However, statistical summaries for the target variable don’t make much sense.Why? Python simply detects that it’s a numerical value and computes the statistics.

But practically speaking, calculating the mean, for instance, is irrelevant. This is because the target variable is categorical, even though it’s numerically represented as 0 and 1.

Next, let’s check the number of columns:

#6. Printing the number of columns.
print("Number of columns:", len(df.columns))

# Number of columns: 179

So, we have a total of 179 columns. The data exploration phase is underway.
Next step: Are there any missing values?

#7. Checking for missing values.
df.isna().sum().sum()

# 0

Let’s sum the missing values to verify their presence. If any are found, they must be addressed. In each project, I handle different situations: some projects include datasets with missing values, while others do not. This variety helps explore multiple aspects of machine learning projects.

In this specific case, we have 179 columns, where:

  • 178 columns represent input data (predictor variables).
  • 1 column is the target variable (output variable).

Now, we’ll transform this dataset into a supervised learning problem by providing the model with both input and output data.

Another critical aspect in classification problems is analyzing the prevalence of the positive class, which is the proportion of samples with the feature we aim to predict.

In this scenario:

  • (1): Machines that required maintenance (event occurrence).
  • (0): Machines that did not require maintenance (event non-occurrence).

The prevalence is calculated as:

For example, if the prevalence rate is 0.2 (20%), this indicates that 20% of the machines in the sample required maintenance.

Let’s calculate this prevalence and proceed with the next steps.

#8. Function to calculate the prevalence of the positive class (label = 1).
def calculate_prevalence(y_actual):
return sum(y_actual) / len(y_actual)

I am now presenting a mathematical formula that represents exactly what I just defined.

#9. Printing the prevalence of the positive class.
print("Prevalence of the positive class: %.3f" % calculate_prevalence(df["TARGET_VARIABLE"].values))

# Prevalence of the positive class: 0.200

And now I present to you the prevalence of the positive class. What does this mean? In our dataset, 20% of the records represent the positive class. Consequently, 80% represent the negative class.

In other words, the dataset is imbalanced.

Business Perspective

Is this an issue from a business perspective? No, it’s merely a characteristic.

In our case, based on the sample data, only 20% of the machines required maintenance. That’s actually good news — most machines did not require it.

Machine Learning Perspective

What happens if we show the model more examples of the negative class than the positive class? The same thing that happens with humans: we learn more about what we are exposed to the most.

For the model, the same applies: it will learn more from the class with more examples. This can cause issues when training.

Key Insight

Business Impact: Imbalance in the dataset is not a business problem but rather a characteristic of the data.

Machine Learning Impact: A dataset imbalance can lead to a biased model that favors the majority class during learning, which will reflect in its performance after training.

Next Steps

Identifying imbalance is crucial because it will require adjustments during the data preparation and model training phases.

While class imbalance is significant in classification tasks, it doesn’t affect regression problems in the same way, as there’s no need to calculate prevalence.

5. Data Cleaning

This is the stage where you’ll address potential issues such as:

  • Missing values
  • Duplicate rows
  • Outliers
  • Strange characters in specific columns
  • Unnecessary columns that don’t contribute to the dataset

Data cleaning is context-dependent, as there’s no one-size-fits-all formula for this step.

Cleaning in Our Example

  • In our case, there are no missing values, which saves us some work.
  • We’ll check for duplicate columns and rows.
  • If no issues are found, we’ll move forward.

In other datasets, you might encounter a significant amount of missing records. You’ll need to apply appropriate techniques to process and clean these before continuing with your workflow.

Importance of Data Cleaning

Cleaning data is a crucial and common task in nearly all Data Science and Machine Learning projects because raw data is rarely ready for use.

In this project, I’ve simplified the dataset to focus on the template and demonstrate the full end-to-end process.

Next Step

Let’s start by preparing the dataset, selecting only the data of interest:

#10. Preparing the dataset with only the relevant data.
collist = df.columns.tolist()
cols_input = collist[0:178]
df = df[cols_input + ["TARGET_VARIABLE"]]

#11. Viewing the first few records of the prepared dataset.
df.head()

Here we have the dataset, focusing on the columns of interest: from x1 to x178, along with the target variable TARGET_VARIABLE.

Let’s check for duplicate columns. What does it mean to have duplicate columns? This can happen when, for example, two columns contain identical data.

Such issues can arise from errors during data collection or extraction. For instance, a mistake in retrieving data from the database might result in two identical columns appearing together. If such duplicates exist, one of them must be removed to ensure data integrity.

The same goes for duplicate rows. If there are any repeated rows in the dataset, they must also be removed to avoid skewing the analysis or creating bias in the model. Now, let’s check for these issues.

#12. Checking for duplicate columns in the input data.
dup_cols = set([x for x in cols_input if cols_input.count(x) > 1])
print(dup_cols)
assert len(dup_cols) == 0, "There are duplicate columns."

# set()

The assert statement in Python, highlighted in pink, is used to verify whether a given condition is true or false. If the condition is false, it raises an AssertionError and prints the specified message for debugging.

In this specific case, since we have no duplicate columns, the dataset is clean. This check ensures our input dataset integrity. Now, let’s perform the same verification on the final dataset to ensure everything is in order.

#13. Checking for duplicate columns in the final dataset.
cols_df = list(df.columns)
dup_cols = set([x for x in cols_df if cols_df.count(x) > 1])
print(dup_cols)
assert len(dup_cols) == 0, "There are duplicate columns."

We have no issues with duplicate column names or duplicate columns containing data in our dataset. This means that such problems are not present in our data.

However, it’s important to note that data cleaning can easily take up 30–50% of the total time in a machine learning project.

Not all cleaning tasks will necessarily fall under the responsibilities of a Data Scientist.

In some cases, a Data Engineer may already have established a pipeline that handles basic cleaning tasks such as addressing missing values, removing duplicate records, and other preprocessing steps.

The extent of this depends on the maturity level of the company.

  • For companies new to data science, it’s unlikely they’ll have a robust cleaning pipeline in place. This means the task falls to you.
  • In more mature organizations, existing pipelines may already be implemented and monitored, significantly simplifying your role.

Even in such cases, it’s critical to understand these processes thoroughly. You might need to create, modify, or validate them depending on the specific requirements of your project.

6. Splitting into Training, Validation, and Testing

To build a robust machine learning model, we need to split our dataset into at least three portions: training, validation, and testing.

This ensures the model’s performance is evaluated on unseen data, mirroring real-world scenarios.

6.1 Why Do We Split the Data?

We cannot evaluate the model using the same data it was trained on.

Think of it like school: during classes, you practiced math problems to learn the concepts, but the exam had different problems to test your understanding. The same principle applies here.

  • Training Data: Used to train the algorithm, transforming it into a model.
  • Validation Data: Typically used to test the model during training, often for hyperparameter tuning or early stopping.
  • Testing Data: Used to evaluate the final model’s performance on unseen data.

It’s worth noting that smaller models sometimes omit the validation phase and directly split data into training and testing. However, for larger models or complex scenarios, validation is essential.

6.2 How to Choose the Split Proportions?

The choice of proportions depends on the size of your dataset.

Examples of Common Splits:

  • 70%, 15%, 15%: Training, Validation, Testing (common in balanced datasets).
  • 80%, 10%, 10%: Prioritizes training data for larger datasets.
  • 50%, 25%, 25%: Suitable for exploratory analysis.
  • 98%, 1%, 1%: Useful for very large datasets (e.g., billions of records).

Example Analysis:

  • 100 Records Dataset: Using a 98%, 1%, 1% split would allocate just 1 record each for validation and testing — clearly insufficient.
  • 1 Billion Records Dataset: A 98%, 1%, 1% split still leaves 10 million records each for validation and testing, which is more than enough.

6.3 Random Sampling and Shuffling

When splitting the dataset:

Random Sampling: Ensures that the samples are diverse and representative of the dataset. Without it, you risk introducing bias if, for instance, consecutive records are assigned to training or testing.

Avoid Shuffling in Time Series: If the data involves time (temporal trends), preserve the chronological order. For other cases, random shuffling helps prevent overfitting or bias.

For our project, we’ll perform random sampling to create the training, validation, and testing datasets. Since time isn’t a factor here, shuffling will ensure a balanced split across all samples.

#15. Generating random samples from the data.
df = df.sample(n=len(df))

Next, we will adjust the dataset indices:

#16. Resetting the dataset indices.
df = df.reset_index(drop=True)

And now, I will prepare the index for splitting:

#17. Generating an index for the split.
df_valid_test = df.sample(frac=0.3)
print("Size of validation/test split: %.1f" % (len(df_valid_test) / len(df)))

# Size of validation/test split: 0.3

Observe that I will extract 0.3, or 30%, of the data from my original sample.

I will randomly select 30% and place it in df_valid_test.

Now, I will proceed with a 70-15-15 split: 70% for training, 15% for validation, and 15% for testing.

#18. Performing a 70/15/15 split.

# Test data
df_test = df_valid_test.sample(frac=0.5)

# Validation data
df_valid = df_valid_test.drop(df_test.index)

# Training data
df_train = df.drop(df_valid_test.index)

Notice that I’ve already created a sample containing 30% of the data, correct? From this 30%, I’ll take half. Half of 30% is 15%, and I’ll allocate this to the test set. Then, I’ll drop what I’ve already placed in the test set. So, where does the other half go? To validation.

Everything else, the remaining 70%, will go to training.

This is a 70–15–15 splitting strategy, but executed in reverse.
First, I divided the data into 30%. Then, I split this 30% into two parts: 15% for testing and 15% for validation. Next, I returned to the original dataset, dropping the portion already used for testing. What remains — 70% — goes into training.

That’s it! The samples are successfully created. With this splitting method, I managed to maintain class prevalence across each subset.

#9. Printing the prevalence of the positive class.
print("Prevalence of the positive class: %.3f" % calculate_prevalence(
df["TARGET_VARIABLE"].values))

# Prevalence of the positive class: 0.200

We previously calculated (referencing step #9) that 20% of the records belong to the positive class. This means 20% prevalence in the dataset.

It’s important to ensure this prevalence is carried over to each subset during the division process.

Soon, I’ll explain how to perform class balancing, but for now, let’s calculate and confirm whether the prevalence was maintained across our subsets:

#19. Checking the prevalence in each subset.
print("Test (n = %d): %.3f" % (len(df_test), calculate_prevalence(df_test.TARGET_VARIABLE.values)))
print("Validation (n = %d): %.3f" % (len(df_valid), calculate_prevalence(df_valid.TARGET_VARIABLE.values)))
print("Train (n = %d): %.3f" % (len(df_train), calculate_prevalence(df_train.TARGET_VARIABLE.values)))
Checking the prevalence in each subset.

The prevalence doesn’t have to be exactly the same but should be close.

So, 20% prevalence in the test set, 20% in validation, and 20% in training. That means I managed to reflect the data pattern across all three subsets.

#20. Verifying that all samples are accounted for.
print('All samples (n = %d)' % len(df))
assert len(df) == (len(df_test) + len(df_valid) + len(df_train)), 'Something went wrong'

# -----> All samples (n = 11500)

All the samples total 11,500 rows, which matches the original dataset — nothing was left out.

I divided all the records properly. However, prevalence is an issue for the Machine Learning algorithm. Let’s address this…

7. Class Balancing

We’ve completed step 6: dividing the data into train, validation, and test sets. This step is necessary for any machine learning project.

You cannot test the model using the same data it was trained on. While you can calculate metrics from the training data, these are training metrics. To truly evaluate whether the model performs well, you must use a separate dataset — either validation, test, or both.

In our case, we ensured that the prevalence of the data was maintained across all samples to ensure that patterns are well distributed.

Now, we move to step 7, which is specific to classification problems. This step is unnecessary in regression tasks.

Class balancing serves a critical purpose: it ensures that the data presented to the model is balanced between the classes, enabling it to learn equitably from both.

Checking the prevalence in each subset.

If I provide training data with this prevalence to the Machine Learning model, what do you think will happen?

Currently, 20% of the data belongs to the positive class, while 80% belongs to the negative class. This reflects the original dataset and is perfectly fine from a business perspective.

However, the model will learn much more about the negative class than the positive class. Why? Because the 0 class has significantly more examples.

But this imbalance is unacceptable for our purposes. If left unaddressed, the model will become skewed, favoring the negative class. We need a model that performs equally well for both classes — positive and negative.

Why Balance Only the Training Data?

Class balancing is applied exclusively to training data. The reason is straightforward: this step is designed to aid the model during the learning phase.

The validation and test datasets are used after the training is complete. At that point, it doesn’t matter if these datasets are unbalanced because the model’s learning process is already finished.

Balancing provides the model with a “push” during training to ensure that it learns effectively from both the positive and negative classes.

#21. Creating an index for positive class samples.
rows_pos = df_train.TARGET_VARIABLE == 1

First, observe that I’ll create an index based on the target variable with the value 1, which represents the positive class.

I’ll then separate the positive and negative values:

#22. Defining positive and negative class values from the index.
df_train_pos = df_train.loc[rows_pos]
df_train_neg = df_train.loc[~rows_pos]

In other words, I’ll split the records into df_train_pos and df_train_neg, which represent the positive class and the negative class, respectively.

#23. Determining the minimum value between positive and negative class samples.
n = np.min([len(df_train_pos), len(df_train_neg)])

Next, I’ll take the minimum value in n.

#24. Obtaining random samples for the balanced training dataset.
df_train_final = pd.concat([df_train_pos.sample(n=n, random_state=64),
df_train_neg.sample(n=n, random_state=64)],
axis=0,
ignore_index=True)

I’ll obtain random values for the training dataset df_train_final.

This step ensures balance.

Notice that I'm using the sample method once again. Why? To ensure the balancing process remains random.

This is crucial to avoid introducing any forced patterns into the data. A random approach is standard in ML and Data Science, except when working with time series.

For now, I’ll proceed with random sampling, and here’s how we perform it:

#25. Sampling and resetting the index for the final training dataset.
df_train_final = df_train_final.sample(n=len(df_train_final), random_state=64).reset_index(drop=True)

Let’s now check the balance:

#26. Printing the class balance in the training dataset.
print('Training Balance (n = %d): %.3f' % (len(df_train_final),
calculate_prevalence(df_train_final.TARGET_VARIABLE.values)))

# -----> Training Balance (n = 3186): 0.500

See that I have a 50/50 balance. It’s not mandatory to have exactly 50/50. You could have distributions like 45/55 or 48/52. It’s not a strict requirement.

In this case, what did we do? We simply sampled examples from one class to balance with the other. Notice that we reduced the size of the training data. Earlier, what was the size?

Checking the prevalence in each subset.

About 8,050 rows in the training set, right? Here, we reduced it to 3,168 rows. Why? Because we applied a strategy called undersampling.

What is undersampling?

It involves reducing the size of the majority class. In our case, the majority class is the negative class (0).

By removing records from this majority class, we lose data. This reduction is intentional and forms the basis of the undersampling approach.

Oversampling

Oversampling does the opposite: it increases the data volume for the minority class. This often involves creating synthetic data.

Pros and Cons:

Undersampling:

  • Advantage: You work exclusively with real data, avoiding synthetic additions.
  • Disadvantage: You lose some data.

Oversampling:

  • Advantage: You don’t lose data; instead, you increase the dataset size.
  • Disadvantage: Synthetic data can introduce biases or tendencies in the model.

Which strategy is better?

There isn’t a universal answer — it depends on the data and context:

  1. Large datasets: Undersampling works well.
  2. Small datasets: Oversampling is often better since preserving data volume is critical.
  3. Deep Learning models: These algorithms require significant amounts of data. In this case, oversampling is almost always the go-to option.

I’ve added a brief summary highlighting the differences between undersampling and oversampling.

For this example, we used undersampling.

Let’s save everything we’ve done so far to disk:

#27. Saving all datasets to disk in CSV format.
df_train.to_csv('train_data.csv', index=False)
df_train_final.to_csv('train_data_balanced.csv', index=False)
df_valid.to_csv('validation_data.csv', index=False)
df_test.to_csv('test_data.csv', index=False)

This is a good strategy, and here’s why: after all the significant work we’ve done — preparing the data, dividing it into subsets, balancing classes, and so on — it’s important to solidify our progress.

What should you do now?

#28. Saving the input data (predictor columns) for later use.
pickle.dump(cols_input, open('cols_input.sav', 'wb'))

We will save everything to disk and create a dump of the column names.

This will generate a file containing only the column names, which will make it easier later when loading the data or even new data.

This strategy can also help in future projects. Then, we will create the X and Y matrices:

#29. Defining the feature matrices (X).
X_train = df_train_final[cols_input].values
X_valid = df_valid[cols_input].values

#30. Defining the target vectors (Y).
y_train = df_train_final['TARGET_VARIABLE'].values
y_valid = df_valid['TARGET_VARIABLE'].values

At this stage, these matrices are practically the final step before training the model.

So, let’s convert the data into matrices, starting with X and then Y. After that, we will print their shapes:

#31. Printing the shapes of training and validation datasets.
print('Shape of training data:', X_train.shape, y_train.shape)
print('Shape of validation data:', X_valid.shape, y_valid.shape)

And then, we have the format of the training data for you:

#32. Displaying the training feature matrix.
X_train

Up to this point, we are working with the raw original data format. But at any moment, did I change the original data format? No.

I made adjustments, moved data here and there, removed some records, but the data remains in its original format, with the same scale, for instance.

No changes have been made so far. However, now it’s time to make the change — precisely, the standardization.

8. Standardization

A typical machine learning project involves around 15 steps, and I will cover them all in this project.

We are now reaching approximately the halfway point with step number 8. Up to this point, we’ve done a tremendous amount of work, made numerous decisions, and there’s still much more ahead.

A professional-level machine learning project is a task that requires significant effort and is considered a high-level activity.

So, why do we need standardization, which is a data preprocessing strategy? The reason lies in the fact that the data are in different scales.

This impacts various machine learning algorithms. Many algorithms, in fact, assume that the data are already on the same scale.

From a business perspective, having different scales is not a problem — it’s often expected. However, machine learning is rooted in mathematics.

For instance, consider the following scenario:

88 is a very different scale compared to 684, isn’t it? When the model performs mathematical calculations, it will end up assigning more weight to features with this kind of scale. In contrast, consider a range like -24 to -28, where the scale difference is much smaller.

This means the model will give disproportionate weight to features with larger scales.

As a result, this creates a series of problems, leading to an imbalanced model, a biased model — essentially, a model you don’t want. What you want is a model capable of achieving mathematical generalization.

That’s why many algorithms (though not all) benefit from data standardization, which involves putting all features on the same scale.

However, there’s an important detail: when standardizing, you must train the scaler using only the training data.

Then, you apply the scaler to the training, validation, and test sets. For now, I’ll focus on using just training and validation.

Take note: You cannot apply standardization before splitting the data into training and testing sets.

Standardization must happen after the split because of the fit process. The fit step trains the scaler—just as the name suggests—using the training data. Once the scaler is trained, you can then apply it to the training, validation, and test sets as needed.

Let’s now begin by creating the scaler using StandardScaler:

#33. Creating the scaler object for standardization.
scaler = StandardScaler()

We perform the FIT, which is the actual training process:

#34. Fitting the scaler to the training data.
scaler.fit(X_train)
scaler.fit(X_train)

I will define the name scalerfile for this scaler:

#35. Saving the scaler object to disk for future use.
scalerfile = 'scaler.sav'

And I will save it to disk:

#36. Saving and loading the scaler object using pickle.

# Save the scaler object to disk.
pickle.dump(scaler, open(scalerfile, 'wb'))

# Load the scaler object for future use.
scaler = pickle.load(open(scalerfile, 'rb'))

As soon as I save it, I will immediately load it for use in the next steps, and I’ll explain why. Then, I apply the normalization to our data matrices:

#37. Applying standardization to the data matrices.
X_train_tf = scaler.transform(X_train)
X_valid_tf = scaler.transform(X_valid)

And next, you now have the data properly standardized:

#38. Displaying the transformed training feature matrix.
X_train_tf
standardized matrices

These data are now in a standardized format. Pay close attention — maximum attention here. The information remains exactly the same as in the original matrix above. I merely applied a mathematical trick to adjust the scale of the data. In other words, I modified the data without altering the underlying information.

You can change the data however you like, but you cannot modify the information. If you do, you’ll be changing the essence of the work we are doing. Do you agree with me?

This process of standardization is a mathematical trick very similar to what you learned with quadratic equations. For example, you would multiply both sides of the equation by two. Why? To simplify the equation until you found the value of x. The same idea applies here.

I am simplifying the data by applying a mathematical trick. Did I lose the essence of the information? No. Just like with quadratic equations, when you simplified them, you didn’t lose the essence of the value of x. It was simply a mathematical simplification.

Here, I’m doing the same thing: modifying the data without losing the information. If you modify the information, that’s wrong because you’ve altered the underlying pattern of the data. But the way I’m showing you ensures that only the datais modified.

Why did I save the scaler to disk and immediately load it afterward? Later, I will use the same scaler again. Any transformation applied to the training data must also be applied to the test data and any new data.

Therefore, when I use the trained model to make predictions, I must apply the same standardization strategy. That’s why I saved the scaler to disk and immediately loaded it — to ensure the file is working.

In computing, anything can go wrong. Absolutely anything. When saving the file to disk, it might get corrupted, lose privileges, or even end up in the wrong folder. Anything can happen.

So, after saving the file to disk, I immediately load it back to verify that it works properly.

9. Predictive Modeling and Metric Reporting

From steps 1 to 8, we haven’t used machine learning yet, even though it could have been applied at certain points. For instance, machine learning can be utilized in balancing strategies, but in a typical project, we usually don’t engage with Machine Learning in the earlier stages.

Now, we are entering the stage of predictive modeling. Here, we will build a model that learns the relationship between input data and the target variable — if such a relationship exists. Once the model is trained, we can provide it with new data, and it will generate predictions, which is our ultimate goal.

That’s why this step is called predictive modeling. It is almost a world of its own — step 9 encompasses numerous possibilities. Let me first give you an overview of what we’ll be doing here:

  1. Define functions to calculate metrics.
  • These will be detailed throughout the process.
  • Each metric is explained in the notebook, and I recommend reviewing them carefully.

2. Create versions of the model:

  • Version 1: Linear models.
  • Version 2: Probabilistic models.
  • Version 3: Decision tree models and boosting techniques.

3. Select the best version:

  • Apply cross-validation and hyperparameter optimization.
  • Evaluate and interpret the metrics of the best model.

4. Visualize results:

  • Create some plots to understand the model’s performance better.

5. Final steps:

  • Deploy the model.
  • Conclude the project.

This stage introduces an immense amount of content. Can you believe we’re still only halfway through this project? It’s incredible, isn’t it?

In predictive modeling, we will explore what is necessary to create the best model possible. Do you know the ideal algorithm for this dataset? Neither do I. That’s why we need to experiment.

Do you know the ideal combination of hyperparameters for each algorithm you test? Neither do I. That’s why we need to experiment.

How many versions will we create? Maybe 3, 4, 5, or even 6 — until we achieve the best model possible.

After generating a few versions, we must select the best model, then perform the final evaluation, deployment, and delivery.

This process can be highly iterative. You might create one version and realize it performs poorly. Then you go back, tweak the process, and create another version. Maybe it improves slightly. Then you wonder, “What if I try this?” And so, the cycle continues until you achieve the best model possible.

To start, let’s create the functions:

#39. Function to calculate specificity.
def calc_specificity(y_actual, y_pred, thresh):
return sum((y_pred < thresh) & (y_actual == 0)) / sum(y_actual == 0)

First, the calc_specificity function—this is used to calculate specificity, as we don’t have a built-in function for this in sklearn. Take a look below:

#40. Function to generate a metrics report.
def print_report(y_actual, y_pred, thresh):

#40.a. Calculate AUC.
auc = roc_auc_score(y_actual, y_pred)

#40.b. Calculate accuracy.
accuracy = accuracy_score(y_actual, (y_pred > thresh))

#40.c. Calculate recall.
recall = recall_score(y_actual, (y_pred > thresh))

#40.d. Calculate precision.
precision = precision_score(y_actual, (y_pred > thresh))

#40.e. Calculate specificity.
specificity = calc_specificity(y_actual, y_pred, thresh)

print('AUC: %.3f' % auc)
print('Accuracy: %.3f' % accuracy)
print('Recall: %.3f' % recall)
print('Precision: %.3f' % precision)
print('Specificity: %.3f' % specificity)
print(' ')

return auc, accuracy, recall, precision, specificity

I have functions to calculate AUC, accuracy, recall, and precision, but not one for specificity. That’s fine — I know how to program in Python, so I’ll create my own function. Let this serve as an example for you.

“Oh, but there isn’t a built-in function in the framework!” Yes, frameworks aren’t perfect — they might lack certain functions. However, if you understand the concept and the mathematical formula, you can reproduce it using Python programming.

That’s exactly what I did in command #39, where I created a function to calculate specificity.

Next, in command #40, I created a function to print a metric report, which can also serve as a reference for your future projects. The notebook contains a complete description of each metric for your understanding.

After that, we move on to prepare the Threshold:

#41. Setting the threshold to 0.5 for labeling predicted samples as positive.
thresh = 0.5

And we are ready to create the first version of our machine learning model, using a linear model.

9.1 Linear Models

We can now create the first version of our model. Let’s start by working with algorithms from the linear models category.

You can’t know in advance which algorithm will be the best. That’s why we do data science — to conduct experiments. We will test a few algorithms, and then we can say, “This algorithm from this category is ideal for this dataset.”

However, the algorithm that works well here might not perform as well in another project. This is why it’s important to learn as many machine learning algorithms as possible.

For this project, I’ve brought you three categories:

  • Linear models
  • Probabilistic models
  • Tree- and decision-based models

These categories cover a vast range of possibilities.

Now, you might ask: Which category should I start with?
I always recommend starting with the simplest category, which is linear models. And that’s exactly what we’re going to do.

In this case, since it’s a classification problem, we will use the logistic regression algorithm — perhaps one of the simplest yet most effective machine learning algorithms.

Why start with the simplest option?
Because it allows you to establish a benchmark — a baseline or starting point. This is the simplest algorithm I can create, and it provides a certain performance. Can I improve on this performance?

If yes, you can then explore more complex algorithms. As you gain experience, it becomes natural to start with more complex algorithms, knowing they might perform better. But for beginners, the best guideline is:

Start with the simplest algorithm, establish your baseline, and then aim to improve the model’s performance by testing algorithms from other categories.

Let’s now call the LogisticRegression function from sklearn:

#42. Building the logistic regression model.

#42.a. Create the classifier (object)
lr = LogisticRegression(max_iter=500, random_state=142)

#42.b. Train and create the model
modelo_dsa_v1 = lr.fit(X_train_tf, y_train)

#42.c. Predictions
y_train_preds = modelo_dsa_v1.predict_proba(X_train_tf)[:, 1]
y_valid_preds = modelo_dsa_v1.predict_proba(X_valid_tf)[:, 1]

print('\nLogistic Regression\n')

print('Training:\n')
#42.d. Generate the metrics for training.
v1_train_auc, v1_train_acc, v1_train_rec, v1_train_prec, v1_train_spec = print_report(y_train,
y_train_preds,
thresh)

print('Validation:\n')
#42.e. Generate the metrics for validation.
v1_valid_auc, v1_valid_acc, v1_valid_rec, v1_valid_prec, v1_valid_spec = print_report(y_valid,
y_valid_preds,
thresh)

This is now the machine learning algorithm.

In #42a, I will define two hyperparameters:

  • One for the maximum number of iterations.
  • Another for the random state, to ensure the random process starts with the same pattern consistently.

This will create the object lr.

  • Next, in #42b, I take the lr object and perform the fit (training) using the transformed training data(X_train_tf) and y_train (which does not require transformation).
  • Finally, in #42c, I extract the probability predictions using this model and print the metric report for you.
Logistic Regression

To simplify, I’m currently using only the training and validation data. The test data will be used later when evaluating the final version of the model.

Notice that we have metrics for both training and validation.

  • AUC is the area under the curve. Shortly, I will show a graph that illustrates AUC very well. It ranges from 0 to 1, with higher values being better. Here, we achieved 0.63, rounded.

Important Note: When running this on your machine, the values might differ slightly due to your CPU’s calculation precision. Don’t forget this!

People often ask: “Why is my result different?” It’s because of your CPU’s precision. I’m using an M2 processor, so use this as a reference to compare with your machine.

Other metrics also range from 0 to 1, with higher values indicating better performance:

  • Accuracy
  • Recall
  • Precision
  • Specificity

When comparing validation metrics to training metrics, you want them to be similar. If there’s a significant discrepancy, this signals a potential issue:

Underfitting:

  • The model doesn’t learn; it fails to identify patterns in the data.
  • Training performance will be poor (an AUC of 0.2 suggests underfitting).

Overfitting:

  • The model overlearns details from the training data and cannot generalize to other datasets.
  • In this case, training performance might be reasonable, but validation performance will be poor.

Now I ask you: Is this model good or bad?

Logistic Regression

In terms of metrics, they are similar across the two samples, which suggests that the model appears balanced. But to determine whether the model is good or bad, you need a comparison criterion, right? Otherwise, it becomes a matter of opinion — everyone has their own “guess.”

Here, we don’t guess. We deal with facts, analysis, and science.

The best way to evaluate whether a model is good is to compare it to another model. However, if you want to evaluate a single model, here’s a basic rule of thumb:

  • Above 0.50 in AUC and accuracy: The model is reasonable and has learned something.
  • Below 0.50: The model is very poor — do not use it.
  • 0.50 to 0.70: A model in this range may be usable but comes with a significant margin of error.
  • Above 0.70: This is where you start considering the model as good enough to use.

Achieving 100% is rare — only possible due to rounding when training metrics are close. Your goal should always be to improve performance as much as possible.

So, can we definitively say this model is good? Not yet. It might even be the best model, but we’ll only know after comparing it to others. That’s why creating multiple versions is crucial.

One strategy would be to stick with logistic regression and fine-tune its hyperparameters. This is a valid option — tweaking hyperparameters and creating new versions of the lr model.

But then I ask you: Are linear models the ideal category for these data?

There’s only one way to find out: Create a version from another category.

For this reason, in version 2, we’ll work with probabilistic models.

9.2 Probabilistic Models

You’ve created the first version of your model and calculated the metrics. Here they are:

Logistic Regression

We now have three options moving forward. The first option is to consider this as the best model you can create and end the predictive modeling process. While this is a valid choice, it’s by far the worst. If someone asks you later, “Is this the best model you could create?” your answer would be, “I don’t know, I didn’t test other options.” This carries a risk but might happen if you lack time or resources to create other versions.

The second option is to continue refining the current approach, either by working further with logistic regression or exploring other algorithms within the linear models category. If you believe this category shows promise, you can refine the current algorithm or try alternatives.

The third option is to switch categories. Linear models might not be ideal for these data. For instance, you could explore probabilistic models, which change the way the algorithm learns from the data. Linear models rely on a set of mathematical calculations, while probabilistic models often use principles based on Bayes’ Theorem.

For this, I’ll use GaussianNB, a representative of the probabilistic category. One major advantage of this algorithm is its simplicity in explanation. A quick search will show you the mathematical formula of Bayes' Theorem. Essentially, this algorithm implements that formula programmatically, making it easy to explain how it reaches its results.

However, GaussianNB is "naive" because it assumes that all features are independent of one another, which is rarely true in practice. If this assumption holds, the algorithm performs well. If not, the results may fall short of expectations. Nonetheless, it’s worth experimenting with, and that’s exactly what we’ll do now:

#43. Building the Naive Bayes model.

#43.a. Create the classifier (object)
nb = GaussianNB()

#43.b. Train and create the model
modelo_dsa_v2 = nb.fit(X_train_tf, y_train)

#43.c. Predictions
y_train_preds = modelo_dsa_v2.predict_proba(X_train_tf)[:, 1]
y_valid_preds = modelo_dsa_v2.predict_proba(X_valid_tf)[:, 1]

print('\nNaive Bayes\n')

print('Training:\n')
#43.d. Generate the metrics for training.
v2_train_auc, v2_train_acc, v2_train_rec, v2_train_prec, v2_train_spec = print_report(y_train,
y_train_preds,
thresh)

print('Validation:\n')
#43.e. Generate the metrics for validation.
v2_valid_auc, v2_valid_acc, v2_valid_rec, v2_valid_prec, v2_valid_spec = print_report(y_valid,
y_valid_preds,
thresh)

We will create the classifier nb using GaussianNB in #43a. The training will be done in #43b.

Notice the pattern here—see how we create the model while following the same consistent structure. The only thing that changes is the algorithm, nothing else.

Using the transformed training data X_train_tf and the target data y_train, we perform the training.

Next, in #43c, we retrieve the probability predictions, calculate the metrics, and print them for you:

Logistic Regression vs. Gaussian Naive Bayes

It seems like something has happened here, hasn’t it? Look at the metrics for Model 1 and now for Model 2. We only made one change, just one. What was it? We changed the machine learning algorithm.

This demonstrates exactly what I’ve been telling you — it’s always worth experimenting with algorithms from different categories.

Is logistic regression a bad algorithm? Not at all. Logistic regression is excellent. It’s just that it’s not showing good performance for this dataset. Why? Likely because the dataset has characteristics that don’t align well with the rules of logistic regression.

So, what do you do? You change the category of algorithms. And you may discover that a different category is much better suited for your data.

What do we expect here? That the metrics for training and validation are similar, as is the case here. This is a great sign — it indicates that the model is balanced. It has learned mathematical generalization.

The metrics are proportional, similar — not identical — between training and validation. And the only thing we did was change the algorithm category, using Gaussian Naive Bayes. Interesting, isn’t it?

Now, once again, you’ll need to make a decision.

Do you think this model is good enough for your use case? If so, the project is complete. You can wrap it up, move directly to deployment, deliver the results, make the client happy, and move on to the next project.

But there’s always that lingering question, right? Can I improve the model’s performance by changing the algorithm category?

Personally, I can’t settle for just one or two versions. I always experiment with algorithms from different categories to ensure I can select the most suitable algorithm for the dataset at hand.

9.3 Decision Tree and Gradient Boosting Models

Let’s build the third version of our model using an algorithm from the decision tree and boosting category.

Here, I’ll use one of the market favorites: XGBoost.

XGBoost is widely used by data science practitioners, especially in competitions like those on Kaggle. Why? Because XGBoost delivers outstanding performance in the vast majority of cases.

XGBoost is essentially a group of decision trees employing a boosting strategy. Instead of creating a single model, it creates multiple models, where each decision tree helps to improve the next one.

This means XGBoost combines several weak models into one strong model. That’s the core idea behind boosting.

Within this category, we have several algorithms that generally offer good performance, making them worth testing at the very least.

And that’s exactly what we’re going to do now.

#44. Building the Xtreme Gradient Boosting Classifier model.

#44.a. Create the classifier (object)
xgbc = XGBClassifier()

#44.b. Train and create the model
modelo_dsa_v3 = xgbc.fit(X_train_tf, y_train)

#44.c. Predictions
y_train_preds = modelo_dsa_v3.predict_proba(X_train_tf)[:, 1]
y_valid_preds = modelo_dsa_v3.predict_proba(X_valid_tf)[:, 1]

print('\nXtreme Gradient Boosting Classifier\n')

print('Training:\n')
#44.d. Generate the metrics for training.
v3_train_auc, v3_train_acc, v3_train_rec, v3_train_prec, v3_train_spec = print_report(y_train,
y_train_preds, thresh)

print('Validation:\n')
#44.e. Generate the metrics for validation.
v3_valid_auc, v3_valid_acc, v3_valid_rec, v3_valid_prec, v3_valid_spec = print_report(y_valid,
y_valid_preds,
thresh)

Let’s now create the classifier xgbc. We’ll train it, extract the probabilities, and calculate the metrics—just as I did in the previous two versions.

Here’s an important tip for you: when transitioning from one version to another, make small modifications at a time. Otherwise, you won’t know what caused the change in performance. Makes sense, doesn’t it?

In our case, the only change I’m making for now is the algorithm — nothing else. Everything else remains the same.

Once we’ve selected the best model, I’ll move on to hyperparameter optimization for that version and then make further adjustments. But until then, it’s crucial to work incrementally from one version to the next, so you can identify what caused the effect.

For now, we’re only changing the algorithm. Let’s execute it:

Logistic Regression vs. Gaussina Naive Bayes vs. XGBoost Classifier

What do you observe in the metrics? We achieved an improvement compared to the probabilistic model. Training another version of the model was worth it, wasn’t it?

That’s the key takeaway I want to emphasize: you cannot know beforehand which algorithm will perform best. It’s simply impossible. And this is what you’ll face in every machine learning project — you’ll need to experiment with alternatives until you find the best possible model.

For didactic purposes, I’ll stop here, as there are still five more steps to show you in this project template. However, you could continue. You could explore other algorithms within each category or even experiment with more categories.

Notice that we achieved essentially 100% in training, though it drops slightly in validation. While 100% in training might seem like something to celebrate, it’s not. It can actually indicate overfitting, which is a common characteristic of XGBoost.

XGBoost learns so much — perhaps too much — that it captures the minutiae of the data. While this might sound paradoxical, it’s not what you want. You don’t want the model to learn the details of the data; you want it to learn the mathematical generalization.

The 100% performance in training could be a sign of overfitting, as evidenced by the margin of error in the validation data.

So, what’s next? Once again, it’s time to make a decision.

You might not have all the information you need right now to make the best decision, and that’s okay. Make your choice, move forward, and if you later realize it was the wrong decision, you can always go back and revise it. You can revisit and adjust your choices, creating another model.

In this case, my decision is as follows:

We’ve created three versions, and the XGBoost version (Version 3) has shown the best performance. So, I’ll proceed with Version 3.

To confirm whether there is overfitting, I’ll use cross-validation. After that, I’ll apply hyperparameter optimization to find the most accurate version of the model possible.

From this point on, I’ll focus exclusively on Version 3 with XGBoost.

10. Cross-Validation

We’ve created three versions of the machine learning model, using three algorithms from three different categories. The third version showed the best results and performance.

Can I now take this Version 3 model, deploy it, and start predicting whether new machines need maintenance? No!

But why not?

After all this effort to create the model, why can’t it be used yet?

Let’s address an important point: your job is not to create machine learning models. Your job is to solve business problems. Machine learning is just a means to achieve that goal.

This means you must ensure that you’re delivering the best model possible.

So, is the Version 3 model the best possible model? The honest answer is: I don’t know.

At this point, we have mechanisms to verify whether this model is truly good or not.

The first layer has already been completed — choosing the model. We worked with three versions and identified the one with the best performance. That’s done.

The next layer is to verify whether this model can actually be used. And we have mechanisms for that, such as cross-validation, which is the step we’ll focus on now.

XGBoost Classifier

For example, during the training of XGBoost, the metric reached 1, or 100%, in all cases. This isn’t necessarily a good sign — it could indicate a problem, such as overfitting. Therefore, I need to ensure that the model is actually working well.

The number 1 doesn’t inherently mean something good or bad. It needs to be investigated further. That’s exactly what I’ll do now in step 10, with cross-validation.

The purpose of cross-validation is to ensure the generalization ability of a predictive model, which is precisely what we aim for. I want a model that understands the mathematical relationship between the data, not one that has simply memorized the details of the training data.

Now, the question is: How do we verify this? How do we ensure the model’s ability to generalize?

This is interesting. When we trained the model, we used the training data (fit(X_train_tf, y_train)) based on the split we made earlier. The model learned from one single dataset, right? We only used X_train_tf and y_train—nothing else.

So, what if we trained this model multiple times with different samples of data? This would allow us to verify whether the metrics truly make sense and if they reflect good model performance.

This is exactly what cross-validation does. During cross-validation, I train multiple models with different data samplesto verify if the model consistently delivers the same performance pattern.

Great, isn’t it? That’s the purpose of cross-validation. Let’s apply it now:

#45. Setting up the cross-validation process for the XGBClassifier.

#45.a. Create the classifier
xgbc = XGBClassifier()

#45.b. Configure cross-validation
# For example, using 5 splits and the AUC (Area Under the Curve) scoring metric
n_splits = 5
score = 'roc_auc'

#45.c. Perform cross-validation
cv_scores = cross_val_score(xgbc, X_train_tf, y_train, cv=n_splits, scoring=score)

#45.d. Display the results
print(f"Cross-validation with {n_splits} splits")
print(f"AUC Score in Each Split: {cv_scores}")
print(f"Average AUC Score: {np.mean(cv_scores)}")

First, I’ll create the classifier — essentially setting up the structure of the model, which is the object itself. I’ll then configure the process to perform five splits and use AUC as the evaluation metric, just like in the previous models.

Next, I’ll call the function cross_val_score, passing the xgbc object, the training data (X_train_tf and y_train), and the number of splits (5).

Here’s the key detail: What happens in this process?

The cross_val_score function will take the dataset (X_train_tf and y_train) and create multiple divisions. In practice, the model will be trained first with a subset of the training data and evaluated with another subset for both X and y.

Then, it creates another subset, changing the data used for training and evaluation, and repeats the process. This is done for a total of five rounds.

For each round, the function calculates a score (in this case, AUC). At the end, it calculates the average score across all rounds.

This method ensures that the model is evaluated on different subsets of data, providing a more reliable measure of its performance. Let’s execute this process:

Look at what we’ve achieved now: 0.99 in every split. This means that no matter which data samples are passed to the model, it consistently delivers high performance.

This is a strong indication that the model is not suffering from overfitting.

At first glance, observing the metrics, you might think:

  • “100% accuracy in training, but in validation, it makes some errors and doesn’t reach 100%.”
  • Could this be overfitting, where the model learns too much during training and fails to perform well on new data?

To verify this, we deliver the model multiple different data samples. For each sample, we calculate the score and then take the average using the np.mean function (see command #45d).

We divided the data into five splits, and the performance in each split is almost identical. This was precisely the goal.

Now, I have greater confidence in Version 3 of the model. It’s a model that is not overfitting and appears to have learned the generalization of the data effectively.

That’s the purpose of cross-validation — it provides an additional layer of confidence in your model. If you have the opportunity to perform cross-validation, it will give you a stronger sense that the model is a good fit for solving the business problem.

But… is there a way to push this even further? Could we tighten the screws just a bit more to improve performance?

There’s only one way to find out: doing data science.

11. GridSearchCV Hyperparameter Optimization

A machine learning algorithm is nothing more than a function in Python.

It’s essentially a block of code containing the mathematical operations that define the algorithm.

You call this function, pass some arguments to it, and it trains on the data to produce a model.

Since these arguments are Python function parameters, we can make adjustments to the hyperparameters to fine-tune the model’s performance.

%%time

#46. Define the classifier
xgbc = XGBClassifier()

#46.a. Define the hyperparameter space for optimization
param_grid = {
'max_depth': [3, 4, 5],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300],
'subsample': [0.7, 0.8, 0.9]
}

#46.b. Set up GridSearchCV
grid_search = GridSearchCV(xgbc, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

#46.c. Perform the search for the best hyperparameters
grid_search.fit(X_train_tf, y_train)

#46.d. Best hyperparameters found
best_params = grid_search.best_params_

#46.e. Train the model with the best hyperparameters
modelo_dsa_v4 = grid_search.best_estimator_

#46.f. Predictions with the optimized model
y_train_preds_optimized = modelo_dsa_v4.predict_proba(X_train_tf)[:, 1]
y_valid_preds_optimized = modelo_dsa_v4.predict_proba(X_valid_tf)[:, 1]

#46.g. Evaluation of the optimized model
print('\nXtreme Gradient Boosting Classifier - Optimized\n')
print('Best hyperparameters:', best_params)

print('\nTraining:\n')
v4_train_auc, v4_train_acc, v4_train_rec, v4_train_prec, v4_train_spec = print_report(y_train,
y_train_preds_optimized,
thresh)

print('Validation:\n')
v4_valid_auc, v4_valid_acc, v4_valid_rec, v4_valid_prec, v4_valid_spec = print_report(y_valid,
y_valid_preds_optimized,
thresh)

For example, in the case of XGBoost, everything highlighted in red in command #46d represents a hyperparameter.

You might wonder, “Wait a minute, in the creation of the XGBClassifier, you didn’t specify anything—it’s empty inside the parentheses, isn’t it?" Yes, exactly.

When you don’t specify hyperparameter values, frameworks like XGBoost or Scikit-Learn apply default values for each hyperparameter. So, the hyperparameters are there — you just didn’t specify them. The framework used its default settings.

But who guarantees that the default value is the correct value?

When I created the logistic regression model, I explicitly defined two hyperparameters: max_iter=500 and random_state=142.

I specified these values manually. You can do this empirically, adjusting the parameters manually if you already have some knowledge about what works best.

If you don’t specify anything and leave the parentheses empty, the framework completes it with default values. But do you know whether the default values are ideal? Do you think the framework knows the ideal values? No!

So, what can we do? Hyperparameter optimization.

You select the hyperparameters you want to adjust, define a set of values to test for each, and let GridSearchCV handle the rest.

For example, let’s consider the max_depth hyperparameter, which defines the maximum depth of the decision trees created by XGBoost. I specified the values 3, 4, and 5 to test.

You might ask, “Can I test a value like 6?” Absolutely. “What about 50?” Sure, you can.

But how do you decide which values to test? That’s another decision you’ll need to make.

When you define the values for max_depth, what GridSearchCV does is create combinations of all the specified hyperparameters. It generates multiple models to test these combinations. Take a look:

#46.a. Define the hyperparameter space for optimization
param_grid = {
'max_depth': [3, 4, 5],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300],
'subsample': [0.7, 0.8, 0.9]

It starts by creating the first model with the hyperparameters 3, 0.01, 100, and 0.7, calculates the metric, and moves to the next combination (4, 0.01, 100, 0.7), systematically testing all possible combinations.

If too many values are tested, this process can take hours or days, so selecting a reasonable range is essential.

To define values, check the default hyperparameters in the documentation. Start with one value below and one above the default (3 and 5 if the default is 4).

If needed, refine by testing additional values (e.g., 1 and 7) until the best model is found. While this step isn’t mandatory, it ensures the most accurate model and is a best practice.

To execute, define a param_grid dictionary with hyperparameters and their values, create a GridSearchCV object (with cv=5 for cross-validation), and set n_jobs=-1 to maximize CPU utilization.

Call fit to train multiple models, select the best parameters, and use them for predictions. Metrics are calculated as with previous models.

Start with a small range (3–4 values per hyperparameter), analyze the results, and iteratively refine the grid for efficiency and accuracy.

Pro Tip: Start with a few values for each hyperparameter (e.g., 3–4), run an optimization round, analyze the results, and refine the grid. Avoid overloading param_grid with too many values, as this can make the process take days to complete.

XGBoost Classifier Optimized

Observe the best hyperparameters identified. The learning rate was 0.1, which I tested from [0.01, 0.1, 0.2], and 0.1 performed best. The maximum depth was 5, the number of estimators was 200, and for subsamples, it was 0.7.

Notice that the optimizer selected the largest value from each list, which suggests it might be worth running another round with higher values to see if an even more accurate model can be achieved.

Do you see the idea? I’ll stop here to focus on demonstrating the concept, but you can continue testing if you’d like.

Now, let’s examine the metrics. The model maintained excellent performance in training. For validation, the scores were 0.993 and 0.962, compared to 0.993 and 0.959 from the previous iteration. Essentially, the performance is the same.

It demonstrates that we’re likely reaching the performance limit of XGBoost, and there isn’t much room for further improvement.

12. Selection of the Best Machine Learning Model

The selection of the best machine learning model is your opportunity to document everything you’ve done so far.

This is your chance to demonstrate your work, justify your decisions, and show how you arrived at the best model.

So, what did I do here?

#47. Creating a DataFrame with the calculated metrics
df_results = pd.DataFrame({'classifier': ['RL', 'RL', 'NB', 'NB', 'XGB', 'XGB', 'XGB_O', 'XGB_O'],
'data_set': ['train', 'validation'] * 4,
'auc': [v1_train_auc,
v1_valid_auc,
v2_train_auc,
v2_valid_auc,
v3_train_auc,
v3_valid_auc,
v4_train_auc,
v4_valid_auc],
'accuracy': [v1_train_acc,
v1_valid_acc,
v2_train_acc,
v2_valid_acc,
v3_train_acc,
v3_valid_acc,
v4_train_acc,
v4_valid_acc],
'recall': [v1_train_rec,
v1_valid_rec,
v2_train_rec,
v2_valid_rec,
v3_train_rec,
v3_valid_rec,
v4_train_rec,
v4_valid_rec],
'precision': [v1_train_prec,
v1_valid_prec,
v2_train_prec,
v2_valid_prec,
v3_train_prec,
v3_valid_prec,
v4_train_prec,
v4_valid_prec],
'specificity': [v1_train_spec,
v1_valid_spec,
v2_train_spec,
v2_valid_spec,
v3_train_spec,
v3_valid_spec,
v4_train_spec,
v4_valid_spec]})

I created a DataFrame containing each of the metrics for training and validation.

The columns include Logistic Regression, Naive Bayes, XGBoost, and XGBoost-O (where “O” stands for optimized with hyperparameter tuning).

The structure has the title training and validation, repeated four times to match the columns, and includes the following metrics: AUC, ACCURACY, RECALL, PRECISION, and SPECIFICITY.

As the primary comparison metric, I used AUC, which I recommend deciding on before starting to build the models. Why AUC? It’s ideal for comparing models built with different algorithms, which is our case here.

While we have two XGBoost models, we also have Naive Bayes and Logistic Regression. AUC is particularly effective in evaluating models across categories and algorithms.

Finally, I prepared a plot — it’s always helpful to create a visual representation to simplify understanding and convey results more effectively.

#48. Building the plot

#48.a. Set the plot style
sns.set_style("whitegrid")
#48.b. Set the figure size
plt.figure(figsize=(16, 8))

#48.c. Bar plot
ax = sns.barplot(x='classifier', y='auc', hue='data_set', data=df_results)

#48.d. Set the x-axis label
ax.set_xlabel('Classifier', fontsize=15)

#48.e. Set the y-axis label
ax.set_ylabel('AUC', fontsize=15)

#48.f. Set the tick label size
ax.tick_params(labelsize=15)

#48.g. Add legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., fontsize=15)

#48.h. Display the plot
plt.show()

What do you notice here? The blue column represents training data metrics, while the orange column corresponds to validation metrics.

This summary highlights the results of our work.

Logistic Regression showed the worst performance, while Naive Bayes performed significantly better.

Finally, we see essentially a tie between the standard XGBoost model and the XGBoost model with hyperparameter optimization.

This step allows us to document all our work and provide a clear summary of the results, supporting our decisions.

#47. Displaying the comparison table of models
df_results
df_results

Finally, we created a table containing all the results, and now we will sort it based on the column corresponding to the metric we’ve chosen as the selection criterion.

#48. Comparison table of models with metrics in validation, sorted by AUC
df_results[df_results['data_set'] == 'validation'].sort_values(
by='auc', ascending=False)
AUC Validation Score

Here, we’re filtering based on validation data because the decision must be made using validation metrics, not training data. We applied the filter, sorted the table by the selected metric, and arrived at the final result.

Which model should we use? The standard XGBoost, as it achieved the highest AUC.

This decision is based on a technical criterion, not guesswork, discussion, or doubt. While others might prefer a different criterion, that’s fine — as long as a clear criterion is chosen.

We’ve carried out the entire modeling process professionally:

  • Created models with different algorithms.
  • Applied cross-validation to check for overfitting, which wasn’t present.
  • Performed hyperparameter optimization.
  • Built a total of four models.

Using validation metrics and AUC as the criterion, we determined that the standard XGBoost is the best model. This is the model I’ll deliver to the decision-maker and deploy.

Do you understand the process? This approach will repeat in project after project. While the algorithms, techniques, or datasets might change, the process remains the same.

When running this on your machine, be cautious. Many forget that everything depends on the computer’s CPU. Calculation precision varies between CPUs, affecting rounding and decimal places. For example, your results might show the optimized XGBoost as the best model. That’s fine — it reflects your CPU’s calculations. Simply adjust the decision accordingly.

Remember, the CPU is a critical component of the workflow. Ideally, use a computer with the highest precision possible.

Now, let’s save the model to disk:

#49. Saving the best model to disk
pickle.dump(modelo_dsa_v4, open('best_model_dsa.pkl', 'wb'), protocol=4)

And that’s it — we now have the best model. So, what’s next? Another round of evaluation and metric interpretation, but this time using the test data, which we haven’t used until now.

This step isn’t strictly mandatory — you could skip it. Why? It’s primarily to give you extra confidence, ensuring that you’re truly delivering the best possible model.

Didn’t we already create a test sample earlier? Let’s make use of it now. This step, Stage 13, serves as a way to document the final performance of the selected best model.

13. Evaluation and Interpretation of Metrics

Now, we can evaluate and interpret the metrics using a different dataset — the test data.

To do this, I’ll load everything I saved earlier from disk. This is a great way to verify that the files are still valid. Remember, everything on a computer can fail — absolutely everything.

Many people are surprised when problems arise. “How could the file be corrupted? That’s impossible.” Yes, it’s very possible. These issues happen all the time.

So, when saving a file, always remember to load it back to ensure everything is functioning properly.

#50. Loading the best model, columns, and scaler

# Load the best model from disk
melhor_modelo = pickle.load(open('best_model_dsa.pkl', 'rb'))

# Load the input columns and scaler
cols_input = pickle.load(open('cols_input.sav', 'rb'))
scaler = pickle.load(open('scaler.sav', 'rb'))

# Load the data
df_train = pd.read_csv('train_data.csv')
df_valid = pd.read_csv('validation_data.csv')
df_test = pd.read_csv('test_data.csv')

# Create the X and Y matrices

# X
X_train = df_train[cols_input].values
X_valid = df_valid[cols_input].values
X_test = df_test[cols_input].values

# Y
y_train = df_train['TARGET_VARIABLE'].values
y_valid = df_valid['TARGET_VARIABLE'].values
y_test = df_test['TARGET_VARIABLE'].values

# Apply the transformation to the data
X_train_tf = scaler.transform(X_train)
X_valid_tf = scaler.transform(X_valid)
X_test_tf = scaler.transform(X_test)

Let’s load everything from disk — all the files we saved earlier. This includes the model, the column names, the scaler, and the data.

Once everything is loaded, I’ll prepare the matrices by defining X and Y. What needs to be done with the data? I must apply the scaler (the standardizer) again.

Why? Because I saved the data before standardization, so every time I load it, I need to reapply the standardization process to ensure consistency.

#51. Calculating the probabilities

y_train_preds = melhor_modelo.predict_proba(X_train_tf)[:, 1]
y_valid_preds = melhor_modelo.predict_proba(X_valid_tf)[:, 1]
y_test_preds = melhor_modelo.predict_proba(X_test_tf)[:, 1]

Next, I can proceed to make predictions with the model.

After preparing the data, I’ll call the model to generate predictions for the training, validation, and test data.

#52. Performance Evaluation.

thresh = 0.5

print('\nTraining:\n')
train_auc, train_accuracy, train_recall, train_precision, train_specificity = print_report(y_train,
y_train_preds, thresh)

print('\nValidation:\n')
valid_auc, valid_accuracy, valid_recall, valid_precision, valid_specificity = print_report(y_valid,
y_valid_preds, thresh)

print('\nTest:\n')
test_auc, test_accuracy, test_recall, test_precision, test_specificity = print_report(y_test,
y_test_preds thresh)

Now, I’ll evaluate the performance using our custom function. This represents the final version of the model.

So, now I have the metrics for training, validation, and test data, which is exactly what I need. The metrics don’t need to be identical — they just need to be similar.

If there’s a significant discrepancy, it indicates a probable issue. In our case, the metrics are very similar, which is excellent.

Next, let’s create the ROC curve. I’ll generate it for you here:

#53. Calculating the ROC curve and AUC for training, validation, and test data.

# Calculate the ROC curve for training data
fpr_train, tpr_train, thresholds_train = roc_curve(y_train, y_train_preds)
auc_train = roc_auc_score(y_train, y_train_preds)

# Calculate the ROC curve for validation data
fpr_valid, tpr_valid, thresholds_valid = roc_curve(y_valid, y_valid_preds)
auc_valid = roc_auc_score(y_valid, y_valid_preds)

# Calculate the ROC curve for test data
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_test_preds)
auc_test = roc_auc_score(y_test, y_test_preds)

# Plotting the ROC curves
plt.figure(figsize=(16,10))
plt.plot(fpr_train, tpr_train, 'r-', label = 'AUC on Training: %.3f' % auc_train)
plt.plot(fpr_valid, tpr_valid, 'b-', label = 'AUC on Validation: %.3f' % auc_valid)
plt.plot(fpr_test, tpr_test, 'g-', label = 'AUC on Test: %.3f' % auc_test)
plt.plot([0,1], [0,1], 'k--') # Diagonal line for random performance
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
ROC curves

This ROC curve visually represents the model’s performance. The three colored lines correspond to AUC for training, validation, and test data, as indicated by the legend.

How do you interpret this graph? It’s very informative and excellent.

Notice the dashed diagonal line — this represents your minimum threshold. The model’s AUC must lie above this line, in the upper left section.

If your AUC line falls below the diagonal, the model is worthless. You can delete it, discard it, and start over. The diagonal line indicates a performance equivalent to 50% AUC, which is the bare minimum.

What you’re aiming for is the upper left corner. Why?

  • In the upper left corner, you achieve the highest true positive rate and the lowest false positive rate.
  • A false positive is a model error — you don’t want errors, you want accurate predictions. The true positive rate reflects correct predictions.

The diagonal line, by contrast, is much closer to the false positive region, which is undesirable.

In our case, the three AUC lines are very close to the upper left corner. This is excellent — regardless of the data sample used, the model demonstrates consistently strong performance.

Now, we have full confidence to move this model into production.

14. Model Deployment and Use with New Data

Deploying a Machine Learning model is often a source of confusion. The data scientist’s work ends at Stage 13. Once the best possible model is identified, the data scientist hands it off to a Machine Learning engineer and moves on to the next project.

Model deployment is typically not the responsibility of the data scientist, except in some cases where company roles overlap. However, it’s important to recognize that deploying a model is a completely different process, requiring skills more aligned with software engineering.

From Stages 1 to 13, the data scientist has already done an immense amount of work. Deploying the model requires other expertise, like creating web applications, APIs, or smartphone apps — tasks that fall under the purview of Machine Learning engineers.

How is deployment handled?

Deployment can take various forms, depending on the company’s needs:

  • A web application (requires knowledge of HTML, CSS, JavaScript).
  • An API to integrate the model with other software (developed using Python, JavaScript, Java, Rust, or other languages).
  • A Jupyter Notebook, delivering predictions as a CSV file (as we’ll demonstrate).

What’s next?

From this point forward, the company decides the workflow. If it involves a web app, a web developer will write the necessary code. If it involves an API, a backend engineer will handle the integration.

To clarify the process, here’s a simple example of deployment to demonstrate how to use the model effectively:

#54. Loading new data
new_machine_data = pd.read_csv('new_data.csv')

So, I’ll load new data that arrived in a CSV file. What are these new data?

They’re sensor measurements from IoT devices, just like the ones we’ve been working with throughout the project.

#55. Viewing the first few records of the new machine data
new_machine_data.head()
new_machine_data.head()

How many measurements? From X1 to X178 — the exact variables used to train the model.

Now, I need to provide this same number of variables to the trained model.

And yes, the model was trained with standardized data, so I’ll need to apply the scaler to the new data as well.

#56. Applying standardization to the new input data
new_machine_data_scaled = scaler.transform(new_machine_data)

I’ll need to apply the scaler to the new data, just as I did with the training, validation, and test data.

The same standardization process will be applied here to ensure consistency.

#57. Displaying the scaled new machine data
new_machine_data_scaled

Now the data is standardized. These values represent the same information as seen earlier in new_machine_data.head(), but with the scale adjusted.

#58. Class prediction using the best model
best_model.predict(new_machine_data_scaled)

# ----> array([0])

I then pass these standardized data to the model using the predict method, which returns the prediction. In this case, the result is zero. Based on the IoT sensor data, this machine does not require maintenance.

And that’s an example of deploying a Machine Learning model. Here, I’m using the model to solve the specific problem it was designed for.

Once again, the data scientist’s job is not to handle deployment. I hope this concept is now clear because it often causes confusion.

Deployment involves a range of other techniques and tools that go far beyond Machine Learning. In fact, from this point onward, there’s no Machine Learning involved anymore — it’s all about software engineering and application development.

For now, I’ll use a file called best_model, which contains the final trained model.

This is the model — it’s trained, finalized, and saved as a file on disk. I’ve loaded it into memory for this session. I’ll now provide it with the standardized data and receive the predictions in return.

There’s no more Machine Learning happening here. Machine Learning ended with Stage 13. Now, we’re simply using the artifact — the model produced through all the previous work.

If the company desires, nothing prevents us from using a CSV file containing data from multiple machines (IoT sensor readings for each).

I can pass the entire dataset to the model, which will return predictions for each machine. These predictions can then be saved into a CSV file and handed off to decision-makers.

Alternatively, I could present the results in a dashboard, a Power BI graph, or any other visualization tool.

15. Conclusion of the Project and Delivering Results

We’ve reached the 15th and final stage of the project: the conclusion. This phase usually involves two professionals with distinct roles.

  1. The Model Creator:
  • Responsible for documentation, model evaluation, generating insight reports, and conducting practical demonstrations.

2. The Machine Learning Engineer:

  • Focuses on creating a user manual, developing an implementation plan, setting up a monitoring strategy, and ensuring continuous feedback as the model is used in production.

The deployed model will continuously process new data, either from new machines or from the same machines at different times, and deliver predictions accordingly.

Deliverables and Workflow

Depending on the team size and client requirements, deliverables may vary:

  • A documentation file (Word or PowerPoint) summarizing:
  • The model evaluation process.
  • Key insights derived from the project.

Alternatively, it could include a practical demonstration of the model, similar to what has been shown throughout this project.

If the model is integrated into a web application, a user manual might be required to explain how to input data. However, such tasks typically fall within the domain of the Machine Learning engineer rather than the data scientist.

Challenges: Monitoring and Maintenance

Once deployed, the model requires continuous monitoring to address potential issues, such as:

  • Data Drift: A shift in data patterns over time. For instance, historical IoT sensor data may have shown low temperature readings, but if the company’s air conditioning malfunctions, new data might show higher temperature and humidity levels. These changes could lead to model errors, requiring retraining or adjustment.
  • Model Drift: A degradation in the model’s performance caused by changes in hardware (e.g., CPU precision) or software environments. This can also lead to prediction errors and necessitate intervention.

Both data drift and model drift must be handled through regular updates and monitoring.

Final Steps

In summary:

  1. Use Case: In this project, IoT sensor data from new_machine_data.head() was used to predict maintenance needs. If data patterns change over time, the model must be retrained to ensure accuracy.
  2. Maintenance: Continuously monitor the model, document errors, and retrain when necessary to maintain its effectiveness.

Project Closure

With this, the project is completed, the results are delivered, and the client is satisfied. Time to move on to the next challenge.

Thank you, as always! 🐼❤️
All images, content, and text by Leo Anello.

Bibliography, References, and Useful Links

Core Machine Learning and Tools

  1. Scikit-learn Documentation: https://scikit-learn.org/stable/
  2. XGBoost Documentation: https://xgboost.readthedocs.io/
  3. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron

Hyperparameter Optimization

  1. GridSearchCV — Scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Data Drift and Model Monitoring

  1. “Understanding Data Drift”: https://evidentlyai.com/blog/data-drift

IoT and Sensor Data

  1. “IoT Applications in Predictive Maintenance”: https://www.ibm.com/topics/iot-predictive-maintenance

--

--

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by @panData

🐼 From Latin Pan "all" or "every". @panData is a repository of projects and self-taught skills.

Responses (2)