Machine Learning Made Simple: Building your First Model

Honzik J
9 min readMar 15, 2023

--

Introduction

Behind every successful machine learning application lies a robust model building process. In this article, we’ll guide you through the key steps of model selection, training, and evaluation so that you can build effective machine learning models. We’ll cover various algorithm types, methods for training models, and evaluation metrics, applying these in the context of a churn model. Additionally, we’ll explore common challenges in the model building process and best practices for overcoming them. Whether you’re a seasoned data scientist or just starting out, this article will equip you with the knowledge you need to take your machine learning skills to the next level.

The model building process

Building a machine learning model is like raising a child — you teach it, you feed it data, and you hope it turns out smarter than you.

Whichever model we choose, the basic process for supervised learning is relatively similar. Model building is an iterative process that follows the following basic steps:

  1. Train the model: Train the model on the training set using an appropriate algorithm. During training, the model learns to make predictions by adjusting its internal parameters based on the input data.
  2. Evaluate the model: Evaluate the model’s performance on the testing set using appropriate evaluation metrics, such as precision, recall, F1 score, or mean squared error.
  3. Improve the model: If the model performance is unsatisfactory, you can improve it by changing the model architecture, hyperparameters, or optimisation algorithm. You may also consider collecting more data or doing some feature engineering.

You will likely repeat this process many times before you arrive at your final model. Let’s look at the process of doing this with a simple classification model to predict churn.

Model building example

In this example, we will build a churn model using the telco dataset. Let’s define our business problem first, then translate it into a machine learning problem:

Business problem: Our objective is to predict which customers are likely to churn, given a set of their attributes. This will allow the business to develop retention strategies to encourage customers to stay.

Machine learning problem: A supervised learning problem (as we have labelled data) and a binary classification problem (predicting a category of two possible outcomes). Our target variable is churn with possible values [yes, no]

Select an algorithm

Selecting a suitable algorithm is a crucial step in the model building process. Different algorithms are suited to different types of problems, and selecting the wrong one can result in poor performance. Here are some techniques for selecting the best model:

  • Consider the type of problem you’re trying to solve: Is it a regression problem (predicting a continuous value), a classification problem (predicting a categorical value), or a clustering problem (grouping similar data points together)?
  • Assess the complexity of the problem: Does the problem require a simple model, or can it handle a more complex one?
  • Evaluate the size and quality of the data: Do you have enough data to train a complex model? Is the data noisy or incomplete?

Commonly used supervised learning algorithms

  • Classification problems: Logistic Regression, Decision Trees, Support Vector Machines, Neural Networks
  • Regression problems: Linear Regression, Regression Trees, Support Vector Regression, Neural Networks
  • Time Series problems: Autoregressive Integrated Moving Average (ARIMA), Exponential Smoothing (ETS), Long Short-Term Memory (LSTM)

Selected algorithm:

For this example, we will start simple and use logistic regression.

Let’s start by importing the required packages. For this model, we will be using modules from scikit-learn

# Import necessary libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns

Next, we can download our dataset

# load the telco churn dataset
url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

Data pre-processing

Then we can do some basic data pre-processing:

  • Drop our primary key column (irrelevant features can lead to overfitting, and privacy concerns may arise in a real-life setting.)
  • Binary (one-hot) encode the categorical variables
  • Split our data into feature matrix X and target vector y. Our target, in this case, is ‘churn_yes
  • Split into train and test sets
# Pre-processing

# Binary encode categorical variables
df.drop('customerID', axis=1, inplace=True)
df = pd.get_dummies(df, drop_first=True)

# Prepare feature matrix and target vector
X = df.drop('Churn_Yes', axis=1)
y = df['Churn_Yes']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We now have four sets to work with.

Model training

The next step is to train it on the data. Training aims to adjust the model’s parameters to minimise the difference between its predictions and the actual values.

We can create a logistic regression model instance and train it on our training set. We do this by passing in our training features and training target. The model will learn the parameters based on these.

# Create a logistic regression object
model = LogisticRegression()

# Train the model on the training set
model.fit(X_train, y_train)

Model Evaluation

Once we’ve trained our model, we must evaluate it on the test set. This will tell us how well it generalises to unseen data.

We can use a number of evaluation metrics. For this example, we will begin by looking at the confusion matrix, as well as precision, recall and F1 score.

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the performance of the model
conf_matrix = confusion_matrix(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=model.classes_)
disp.plot()

# Print the evaluation metrics and plot the confusion matrix
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Confusion matrix

A confusion matrix is a table used to evaluate the performance of a classification model by comparing the predicted labels to the true labels of a set of data. The matrix displays the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for each class.

The columns represent the predicted labels, while the rows represent the true labels. The diagonal of the matrix represents the correct predictions, while the off-diagonal elements represent the incorrect predictions.

Confusion matrix for the logistic regression model

Note that the true negative rate is much higher than the true positive rate. This means our model is much better at predicting if a customer will not churn than if they will churn.

Precision and recall

A confusion matrix can then be used to calculate various metrics such as accuracy, precision, recall, and F1 score, which provide a more comprehensive evaluation of the model’s performance.

Precision is the fraction of true positives (TP) among the total number of positive predictions (TP + FP). It represents the proportion of predicted positive instances that are actually positive.

  • Precision = TP / (TP + FP)

Recall is the fraction of true positives (TP) among the total number of actual positive instances (TP + FN). It represents the proportion of positive instances that are correctly predicted and is also known as sensitivity or true positive rate.

  • Recall = TP / (TP + FN)

F1 score is a metric that combines precision and recall into a single score that reflects the balance between them. It is calculated as follows:

  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Evaluation in the context of our business problem

In the context of our churn model problem, we might be interested in either precision or recall.

If the company aims to reduce customer churn, recall may be a more important metric. This is because the cost of false negatives (predicting a customer will not churn when they actually do) may be high, as losing a customer can have significant financial consequences for the company. In this case, the goal would be to identify as many potentially churning customers as possible, even if it means some false positives (i.e., predicting a customer will churn when they actually do not).

On the other hand, if the cost of false positives (predicting a customer will churn when they actually do not) is high, then precision may be a more important metric. This could be the case if the company’s retention strategies are expensive or time-consuming.

Is our model any good?

Let’s look at our metrics. In our case, our metrics are not very impressive:

Precision, recall and F1 score for the logistic regression model

Imbalanced data

An important note is that our dataset is skewed towards no churn, so even a useless model will predict correctly most of the time.

  • The ‘churn = yes’ outcome only represents 26.5% of our dataset
  • This means that if we make a model that just predicts ‘churn = no’ all the time, it would lead to an accuracy of 73.4%.

If a broken clock is correct twice a day, then a broken churn model is correct 73.4% of the time

This highlights the importance of using precision and recall rather than accuracy as a classification metric.

Wash, rinse, repeat

If we’re not happy with our metrics, try again! Don’t worry, this is a very common part of the model building process, and you will almost certainly not get your final model on the first try. This is where the experimentation comes in.

We can either improve the data or the model to improve model performance.

Improving the data

One of the best things we can do is to go back to our pre-processing steps and experiment with the data we’re feeding into our model. Things to try are:

  • Feature selection: adding or removing features and seeing how this affects our performance metrics
  • Feature engineering: This involves creating new features from the existing ones to improve the quality and relevance of the data. This can include transforming or scaling the data, combining features, or adding new features based on domain knowledge.
  • Feature scaling: Scaling can also help improve the performance of some algorithms. The scaling technique applied (or whether we use one at all) will depend on the algorithm we choose.
  • Different imputation strategies: If you have missing values, you can change the way you handle these.

Improving the model

We can also change things about the model. We can try the following:

  1. Algorithm selection: Sometimes, the choice of algorithm itself can impact the performance of our model. It is important to explore different types and choose the one that performs well and is suited to the problem at hand. It’s also important to consider what level of explainability you’ll need. A black-box model might perform the best, but you might need to explain how it’s arrived at its decisions.
  2. Ensemble methods: Using the ‘wisdom of the crowd’, we can combine multiple models to improve the model’s overall performance. This can be done using techniques such as bagging, boosting, or stacking. This is built into certain models such as random forests, gradient-boosted machines etc.
  3. Regularisation: This aims to control the complexity of the model and prevent it from overfitting by adding a penalty term to the loss function during training. This improves the model’s generalisation ability. This can include techniques such as L1/L2 regularisation, dropout, or early stopping.
  4. Hyperparameter tuning: Many machine learning models have hyperparameters that control the model’s behaviour and performance. Hyperparameter tuning involves finding the best values for these hyperparameters to improve the model’s performance.

Now that we have a baseline model, we can iterate on it and build a better model (coming in a future article)

Summary of common challenges and best practices

While building machine learning models, you may encounter various challenges that can impact the performance of your models. Here are some common challenges and best practices for overcoming them:

  • Data quality: Poor data quality can result in inaccurate models. It’s important to clean and pre-process the data before training the models. This includes handling missing values, dealing with outliers, and transforming the data into a suitable format.
  • Overfitting: Occurs when the model fits the training data too closely and performs poorly on new data. To prevent overfitting, use regularisation techniques, collect more data, or use a simpler model.
  • Underfitting: Occurs when the model is too simple to capture the complexity of the problem. To overcome underfitting, use a more complex model, collect more data, or tune the hyperparameters.
  • Imbalanced data: Imbalanced data occurs when one class in a classification problem has significantly fewer samples than the other class. Try techniques such as oversampling, undersampling, or synthetic data generation to handle imbalanced data.
  • Generalisation: Ensuring your model can generalise well to new, unseen data is crucial. Use techniques such as cross-validation, regularisation, or ensemble learning to improve the generalisation of your model.

Conclusion

We have built a simple model to outline the model building process. While our model did not perform too well, we now have a baseline model we can iterate on and improve (coming in a future article)

Following the best practices outlined and addressing common challenges, you can build effective models that generalise well and make accurate predictions on real-world data. Remember always to be mindful of the data quality, perform thorough pre-processing and evaluate with respect to your business problem. By following the steps in this article, you’ll be building models like a pro in no time!

--

--