Eliminate Underfitting and Overfitting with these tricks

13 min readJan 29, 2023

Underfitting and Overfitting are problems that we face in our day, in this article you will learn the most powerful techniques to magically eliminate these problems.

Underfit: It occurs when the model does not present good results for the training and validation data, for example, as we can see in the image that an attempt is made to explain the behavior of the data with a linear regression for the non-linear data. It could also be due to the fact that more important variables are missing that add value to the prediction.
Overfit: Occurs when the algorithm generates good results for the training data, however for the training data, it generates poor performance, causing the model to fail to generalize. It can be due to the presence of outliers in the dataset that can affect model performance, or it can also be attributed to overtraining of the model or improper parameter fitting to the statistical model.
Optimal: It occurs when the model generates good results for the training data and the validation data, achieving a good generalization, for which the model can be used to estimate new cases. This is the goal we must reach, to get a robust and useful model to solve problems.

Techniques to eliminate Underfitting

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

insurance = pd.read_csv('insurence.csv')

We create the new data.

def preprocesing(df):

  df['sex'] = np.where(df.sex == 'male',1,0)
  df['smoker'] = np.where(df.smoker == 'yes',1,0)
  df = pd.get_dummies(df,prefix="",prefix_sep="")

  return df

insurance = preprocesing(insurance)

We create a function to pre-process the data.

def datatset(df):

  X = df.drop(columns=['charges']).values
  y = df.charges.values

  return X,y

X,y = datatset(insurance)

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size = 0.3,
                                                 random_state = 42)

We split the data.

lm = LinearRegression(normalize=True).fit(X_train,y_train)

We provide the training and validation data.

r2_train = lm.score(X_train,y_train)
r2_test = lm.score(X_test,y_test)

We obtain two acceptable determination coefficients, however, we can still improve the performance of the statistical models, for example creating an auxiliary variable.

Creating new variable

fig = plt.subplots(1,1,figsize = (20,8))
_ = sns.scatterplot(data = insurance,
                      x = "age",
                       y = "charges",
                        hue = "smoker")

The graph represents the 4 types of clusters:

The first is for healthy people who do not smoke are healthy, as a consequence, they do not have severe medical problems. People who do not smoke have significant health problems.
People who smoke but have good health conditions.
Users who smoke have serious medical problems.
It can be simplified under two conditions. The first is where the condition is not so serious and the second is when the case is dedicated.

We could create an additional feature, to be able to classify users based on the degree of health of the user. Since, as we can see in the graph, the quality of health influences the medical position.

smoker_no_split = insurance.query("smoker == 'no'")
smoker_yes_split = insuarance.query("smoker == 'yes'")

We create two subsets of the data, to favor the treatment of the data.

fig = plt.subplots(1,1,figsize = (20,8))
_ = sns.scatterplot(data = smoker_no_split,
                      x = "age",
                      y = "charges")

smoker_no_split['medical_problem'] = smoker_no_split["charges"] \ 
                                      .apply(lambda x: "severe" if x>17000 else "light")

For those values greater than $17,000 US, it will classify them as severe medical problems.

fig = plt.subplots(1,1,figsize = (20,8))
_ = sns.scatterplot(data = smoker_yes_split,
                      x = "age",
                      y = "charges")

smoker_yes_split["medical_problem"] = smoker_yes_split["charges"] \ 
apply(lambda x: "severe" if x > 32000 else "light")

From a range higher than $32,000 USD we could create a new group, in a similar way to non-smokers but with a different range.

smoker_yes_split["charges"]=smoker_yes_split["charges"] \ 
.apply(lambda x: 48000 if x > 48000 else x)

We remove outliers.

insurance_clear = pd.concat([smoker_no_split,smoker_yes_split])

We create a dataframe with the data already cleaned.

insurance_clear = preprocesing(insurance_clear)
insurance_clear.medical_problem = np.where(insurance_clear.medical_problem=="severe",1,0)

Again we pre-process the data.

X,y = datatset(insurance_clear)

X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size = 0.3,
                                                 random_state = 42)

Again we separate the data already with the clean data.

lm = LinearRegression(normalize=True).fit(X_train,y_train)

Again we give you the training data.

Evaluation

from sklearn.model_selection import cross_val_score

We will use the cross-validation function to see the percentage of generalization of the model.

r2_train = lm.score(X_train,y_train)
r2_test = lm.score(X_test,y_test)
cv = cross_val_score(lm,X_test,y_test,cv = 10).mean()

The model generates quite simple metrics for the training and validation data, it also has a fairly high generalization percentage, which is why it is capable of solving the problem. We can observe the enormous super powers of adding new complementary variables, we are able to enormously increase the statistical power.

Sometimes the solution is not to use a complex model, it is enough to simply understand the data and find interesting patterns.

Techniques to eliminate Overfitting

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Remove outliers

ford = pd.read_csv('ford.csv')

We load the data.

We observed a clear outlier due to an error in data collection. We observed a clear outlier due to an error in data collection, since it is impossible for there to be vehicles older than the year 2024, according to the current date on which I wrote this article.

There is no logical explanation where the price of a 2020 Ford Mustang with 50 miles is less than a special version of the Ford Focus, it is better to eliminate it since it can skew the model.

ford_clear = ford.query('year<=2024 and price<=48_000')

We remove outliers and select those cars priced below £48k.

plt.style.use('ggplot')

fig,ax = plt.subplots(1,2,figsize = (15,5))
ford.plot(kind = "scatter",
        x = "year",
        y = "price",
        ax = ax[0],
        title = "Diry Data")

ford_clear.plot(kind = "scatter",
        x = "year",
        y = "price",
        ax = ax[1],
         title = "Clear Data")

plt.savefig('remove_otlires.png')
plt.show()

Already with the atypical value eliminated, a better distribution of the data is observed.

We create a correlation matrix, the variables that have the most relationship are the variable year and mileage, since generally a car of a more recent model will be more expensive, since it will present zero wear and tear, which naturally increases the price of the vehicles.

X = ford_clear.drop(columns = ['price'])
X_ohe = pd.get_dummies(X,prefix_sep="",prefix="").values
y = ford_clear.price.values

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_ohe,
                                            y,
                                    test_size=0.3,
                                  random_state=42)

We separate the predictor variables and the variable a predictors.

from sklearn.linear_model import LinearRegression
lm = LinearRegression(normalize=True).fit(X_train,y_train)

We feed the training data to our linear model.

The data set presents an excellent performance for the training data, however, for the validation data its performance is deplorable, incapable of generalizing and as a consequence, it is not useful to predict new data.

Variable Transformation

ford_clear.mileage = ford_clear.mileage.apply(np.sqrt)
ford_clear.price = ford_clear.price.apply(np.log)

Logarithmic transformations and taking the square root of the data can smooth out outliers, as it distributes the data to values closer to a mean value.

Thanks to the transformations that we carry out a better correlation with the variable that we try to estimate is appreciated.


X = ford_clear.drop(columns = ['price'])
X_ohe = pd.get_dummies(X,prefix_sep="",prefix="").values
y = ford_clear.price.values

X_train,X_test,y_train,y_test = train_test_split(X_ohe,
                                  y,
                              test_size=0.3,
                                random_state=42)

Again we separate the data.

Model Evaluation

lm = LinearRegression().fit(X_train,y_train)

We feed the transformed data into a new model.

r2_train = lm.score(X_train,y_train)
r2_test = lm.score(X_test,y_test)
cv = cross_val_score(lm,X_test,y_test,cv = 5).mean()

Already with the transformation of the data, the model offers excellent results for both the training and test data, in addition, we used a cross-validation that generated a fairly good generalization percentage.

test = pd.DataFrame()
test['true_values'] = np.exp(y_test)
test['pred_values'] = np.exp(lm.predict(X_test))


test.plot(kind = "scatter",
          x = "true_values",
          y = "pred_values",
          figsize = (15,5),
          title = "True Values vs Predicted Values",
          label = "Predicted Values")

plt.plot(test.true_values,test.true_values,
         label = "True Values")

plt.legend()
plt.savefig('pred_values_true_values.png')
plt.show()

The model can explain most of the data, so it is able to generalize successfully and can be used to solve the problem.

Number of estimators

This technique can be used for all the algorithms belonging to the decision tree family (e.g., Random Forest, XGBoost, Gradient Boosting, CatBoost, LightGBM, etc), to evaluate the behavior of the loss function for the training data and validation.

To apply this technique, it is recommended to graph from 100 estimators to up to 1000 estimators in an order of 10 in 10 and weight a random state of preference, since the performance of the model can vary, so as not to overload our computer equipment so much.

df = pd.read_csv('insurence_clearv2.csv')

We load the data after performing data cleansing.


def dataset():
    return df.drop(["charges"],axis="columns"),df.charges.values

X,y=dataset()


from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test=train_test_split(X,
                                               y,
                                               test_size=0.33,
                                               random_state=42)

We separate training and validation data.

from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline


tf_colummns=make_column_transformer((MinMaxScaler(),["age","bmi","children"]), 
                                   (OneHotEncoder(drop="if_binary"),["region","sex",
                                                                     "smoker",
                                    "medical_problem"]))

We apply the pipeline model to facilitate data preprocessing.


from sklearn.ensemble import GradientBoostingRegressor

def evaluate(max_depth,lr):
    
    estimator_list=[]
    mse_train_list=[]
    mse_test_list=[]

    estimators=np.arange(100,1000,step=2)
    for estimator in estimators:
      
        model=GradientBoostingRegressor(max_depth=max_depth,n_estimators=estimator,learning_rate=lr,random_state=42)
        model=pipeline_model(model)
        model.fit(X_train,Y_train)
        pred_train=model.predict(X_train)
        pred_test=model.predict(X_test)
        
        mse_train=mean_squared_error(Y_train,pred_train)
        mse_test=mean_squared_error(Y_test,pred_test)
        
        estimator_list.append(estimator)
        mse_test_list.append(mse_test)
        mse_train_list.append(mse_train)

        
    return estimator_list,mse_test_list,mse_train_list


def dataframe_evaluate_trees(max_depth,lr):
  
    n_trees,mse_test,mse_train=evaluate(max_depth=max_depth,lr=lr)

    df_evaluate=pd.DataFrame({"n_trees": n_trees,
                              "mse_test":mse_test,
                              "mse_train":mse_train}) 
  
    return df_evaluate


n_estimators_df = dataframe_evaluate_trees(max_depth=4,lr=0.01)

In this case we use a Gradient Boosting model.
We create a function to store the loss function in lists for the training and validation data, since the dataset has very few data we can afford to use more estimators, this method will receive as parameters the maximum depth of the tree and the learning cup.
We create a dataframe where the performance metrics are stored, to later make the graph.

fig,ax = plt.subplots(1,1,figsize =(15,5)

ax.set_title('Max depth 4')

n_estimators_df.plot(
              kind = "scatter",
              x = "n_trees",
              y = "mse_train",
              label = "Train MSE",
              ax = ax)

n_estimators_df.plot(
              kind = "scatter",
              x = "n_trees",
              y = "mse_test",
              label = "Test MSE",
              ax = ax)

plt.savefig('n_estimators.png')
plt.show()

We generate a graph.

In this particular case, the best range is between 600 to 650 estimators, since from this range there is no significant improvement for the test data, as long as the training data starts, the loss function continues to gradually improve, for which they are symptoms. of beginning overfitting in addition to being more expensive computationally the model.

Deep Learning

After scaling the data to be comparable to each other, as the neural networks are trained more efficiently as they converge faster.

import tensorflow as tf

Network Architecture

model = tf.keras.Sequential([
    
    tf.keras.layers.Dense(input_dim = 9,units = 128,activation = "relu"),
    tf.keras.layers.Dense(units = 64,activation = "relu"),
    tf.keras.layers.Dense(units = 64,activation = "relu"),
    tf.keras.layers.Dense(units = 1,activation = "sigmoid")

])

The initial layer will receive 9 input variables, since it is the number of independent variables and it will have a ReLU type activation function since it is the most used.
The network has two hidden layers with 64 neurons with a ReLU activation function.
Finally the output layer we will use a sigmod function, since it is used for binary classification.

model.compile(loss = "bce",optimizer = "adam",metrics = "acc")

We assign the function binary cross entropy, since it is a binary classification problem.
The optimizer used will be adam, since it generally gives very good results.
The chosen metric will be the precision.

history = model.fit(X_train_scaler,
                    y_train,
                    batch_size = 32,
                    epochs=100,
                    validation_data=(X_test_scaler,y_test),
                    validation_batch_size = 16)

We start with the execution of the neural network.

plt.style.use('ggplot')

def history_plot(history_model):
  fig,ax = plt.subplots(1,1,figsize = (15,5))
  ax.set_title('Loss Function')
  ax.plot(history_model.history['acc'],label = "Train Loss")
  ax.plot(history_model.history['val_acc'],label = "Test Loss")
  ax.legend()
  plt.savefig('loss_function.png')

We create a graph to show the training behavior of the model.

There is clear evidence of overfitting, since the precision for the training data increases considerably while for the validation data it lags far behind.

Techniques

Regularization Techniques

L1: Minimize the impact of variables that do not influence as much but add something of value. It is used when we know that all the independent variables serve to estimate.
L2: The L2 technique assigns a zero to the variables that are not without importance, it is used when we do not know which variables can add value to the model.
L1_L2: Combines data regularization methods.

Drouput

tf.keras.layers.Dropout(rate = 0.3)

rate: Percentage of deactivated neurons.

It consists of eliminating a percentage of neurons from the model, preventing the model from forgetting really very specific patterns caused by overfitting.

Early Stopping

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(patience=3,
                          restore_best_weights=True)

patience: It means that if in iterations the cost function for the validation data does not decrease, it will stop with the training of the model.
restore_best_weights: When the model to reset the best proposed weights of the network.

It is used to prevent the model from further training if the loss function for the test data does not continue to decrease compared to the training data.

Second Model

from tensorflow.keras.regularizers import L1L2
from tensorflow.keras.callbacks import EarlyStopping

model = tf.keras.Sequential([
    
    tf.keras.layers.Dense(input_dim = 9,units = 128,activation = "relu",kernel_regularizer = L1L2(0.001,0.001)),
    tf.keras.layers.Dense(units = 64,activation = "relu"),
    tf.keras.layers.Dense(units = 64,activation = "relu"),
    tf.keras.layers.Dropout(rate = 0.3),
    tf.keras.layers.Dense(units = 1,activation = "sigmoid")

])

model.compile(loss = "bce",optimizer = "adam",metrics = "acc")

We add all these strategies to deal with overfitting.

early_stop = EarlyStopping(patience=3,
                      restore_best_weights=True,
                          monitor='val_loss')

history = model.fit(X_train_scaler,
                    y_train,
                    batch_size = 16,
                    epochs=64,
                    validation_data=(X_test_scaler,y_test),
                    validation_batch_size = 8,
                    callbacks = [early_stop])

Better behavior is noted compared to the previous graph, since the early calving stopped the training since it did not drastically improve for the validation data.

Unbalanced Dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

We load the essential libraries.

df = pd.read_csv('Churn_Modelling.csv')

We load the referent dataset that contains financial information in a certain bank in order to estimate if the user will unsubscribe or not.

plt.style.use('ggplot')
fig,ax = plt.subplots(1,1,figsize = (15,5))
ax.set_title('Unbalanced Dataset')
sns.countplot(data = df,x = "Exited",ax = ax)
plt.savefig('unbalanced_dataset.png')
plt.show()

Clearly there is an imbalance between both categories, something that can be detrimental because the model can give more weight to the class with the greatest presence, so it is our responsibility as Data Scientists to know how to solve this type of problem. problems.

First Model

After preprocessing the data and splitting the data between training and validation.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=6,
                             n_estimators = 400,
                             random_state = 42)
clf.fit(X_train,y_train)

After pre-processing the data and splitting the data between training and validation, we create our first model proposal. In this case we will use a Random Forest algorithm, which is quite similar to the decision tree with the difference of using hundreds of trees, its operation is similar to a voting system, each estimator will generate its prediction and the class with highest vote.

max_depth: Maximum depth of each tree.
n_estimators: Number of decision trees applied.
random_state: Random state of the model.

from sklearn.metrics import confusion_matrix

def cm_plot(cm):
    
    fig,ax = plt.subplots(1,1,figsize = (15,5))
    ax.set_title('Confussion Matrix')
    sns.heatmap(cm,annot=True,fmt=".1f",ax = ax,cmap="crest")

We create a function to plot the confusion matrix to see the assertiveness of the model based on each class.

pred = clf.predict(X_test)
cm = confusion_matrix(y_test,pred)

cm_plot(cm)
plt.savefig('confusion_matrix_inbalaced.png')

We observe that the model for class 1 classifies most of the classes incorrectly which is very detrimental, fortunately there are very effective techniques to deal with this problem.

Second Model

clf = RandomForestClassifier(max_depth=6,
                             n_estimators = 400,
                             random_state = 42,
                             class_weight="balanced")

clf.fit(X_train,y_train)

We leave almost the same architecture the same except for adding an extra parameter.

class_weight: It refers to equalize the weights for each category, it helps enormously when a class is enormously unbalanced as in this case.

We now see an improvement in performance with model accuracy, thanks to a simple adjustment of the algorithm parameters.

SMOTE

The SMOTE function also serves to balance the data, it consists of adding more new data to the categories that are less present in the dataset, it takes similar characteristics of the data to create them.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_balanced,y_train_balanced = smote.fit_resample(X_train,y_train)
X_test_balanced,y_test_balanced = smote.fit_resample(X_test,y_test)

We generate the already balanced data.

clf = RandomForestClassifier(max_depth=6,
                             n_estimators = 400,
                             random_state = 42)

clf.fit(X_train_balanced,y_train_balanced)

We give the already balanced data to the model.

pred = clf.predict(X_test_balanced)
cm = confusion_matrix(y_test_balanced,pred)
cm_plot(cm)
plt.savefig('smote_matrix.png')

With the transformation that we make to the data, they have a better performance than using the original data, however, there are algorithms such as Random Forest and Linear Regression that allow the weight of the data to be distributed equally and achieve the same thing and in less time.

Thank you very much for reading this article, I hope you have learned how to face these problems and how to solve them, remember that this is only the beginning, the most important thing is to practice doing multiple personal projects.

If you want to access the Github repository where there will be notebooks used during the development of the article.

Eliminate Underfitting and Overfitting with these tricks

Techniques to eliminate Underfitting

Creating new variable

Evaluation

Techniques to eliminate Overfitting

Remove outliers

Variable Transformation

Model Evaluation

Number of estimators

Deep Learning

Network Architecture

Techniques

Regularization Techniques

Drouput

Early Stopping

Second Model

Unbalanced Dataset

First Model

Second Model

SMOTE

Written by Amado Vazquez Acuña