A Comprehensive Guide to Predictive Modelling — Part 2

This series aims to explain the process of creating a regression model from preprocessing to model fitting.

Ojasvi Jain & Sidh Satam
10 min readApr 21, 2020

--

In this article, we preprocess our data and create Machine Learning Models that help us predict suicide rates accurately.

In the previous article we looked at some Data Visualization methods that can be found here:

Preprocessing the Data

Our preprocessing involves handling of NA values, encoding and scaling our dataset to reduce computational time.

To learn more about preprocessing data have a look at the following article:

To recap let’s look at the first few rows of our dataset.

df.head()

For the purpose of this article, we will be taking the dependent variable(y) as suicide_no i.e. we aim to predict the number of suicides with the information provided by the other columns.

Handling NA values

df.isna().sum()/df.shape[0]*100

As a thumb rule, we drop a column if more than 30% of the data is NA or Null.

Since nearly 70% of the data is missing for the column ‘HDI for year’, we drop that column.

If we do not drop the column our other option would be to fill NA values with mean/mode or using SMOTE. This would cause around 70% of data in that column to be synthesized which would negatively impact our result.

df.drop(['HDI for year'], axis=1, inplace=True)

Encoding and Scaling

We do not wish to encode or scale our dependent variable so we assign it to a separate variable(y). Also, we wish to drop some columns from our model-fitting so we assign the remaining columns to another variable(X).

We drop suicides/100k pop as it is basically suicides_no divided by population. Which implies that our dependent variable(suicides_no) would have a high correlation with this variable i.e. y would be highly dependent on a single variable in our X.

We drop country-year and gdp_for_year ($) as they are both calculated from existing variables in our X which means it provides no new information than what is already provided by existing columns and it would imply high correlation which would violate the assumption for linear regression models.

X, y = df.drop(['suicides_no', 'suicides/100k pop', 'country-year', ' gdp_for_year ($) '], axis=1), df['suicides_no']

Finding out variable types

Before we perform encoding and scaling we need to decide which variables to encode and which to scale. Furthermore, we need to decide the method of encoding and scaling. To do this we need to find out the type of the variable(ie. Nominal, Ordinal, Interval or Ratio).

We do this by checking the datatypes and head of the data frame.

X.dtypes

Encoding Categorical Variables

From our previous output, we see that the categorical variables are country, sex, age and generation. From the head, we can see that the age column does not contain the actual years but is actually divided into bins, thus it is being treated as a categorical variable and not continuous.

We treat these four variables as nominal variables hence we choose to one-hot encode them.

To learn more about categorical variable encoding take a look at the following article:

We use the pandas function get_dummies() for one-hot encoding. A similar function is available in the sklearn package as well.

X = pd.get_dummies(X, drop_first=True)

Scaling Continous Variables

From our previous output, we see that the continuous variables are year, population and gdp_per_capita ($).

For the year variable, we choose to calculate a new variable which indicates the number of years before the present year.

X['years_ago'] = 2020-X['year']X.drop(['year'], axis=1, inplace=True)X.head()

We choose to apply MinMax scaling to our continuous variables using MinMaxScaler from the sklearn package.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X['population'] = scaler.fit_transform(X[['population']])X['gdp_per_capita ($)'] = scaler.fit_transform(X[['gdp_per_capita ($)']])X['years_ago'] = scaler.fit_transform(X[['years_ago']])X.head()
Example of scaled columns

To learn more about continuous variable scaling take a look at the following article:

Finally splitting the data in test and train using train_test_split function from the sklearn package.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Fitting the model

The main aspects to keep in mind while finding an appropriate model to fit is to ensure that:

  • The model considers all the variables that have a significant impact on the dependent variable
  • The model ensured that it has captured the variability in the features of our dataset i.e. a linear model is not forced upon a non-linear data distribution.
  • The model is not overfitted on the training data.
  • The model minimizes the effect of outliers.

Since we need to predict the number of suicides, i.e. a continuous variable, we use regression models.

Evaluation Metric

To evaluate the overall fit of a regression model, we use two evaluation metrics, R² score and RMSE(Root Mean Squared Error).

R² is the proportion of variance explained, i.e. the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. The R² score is between 0 and 1. Higher values are better because it means that the model explains more variance.

RMSE is the most widely used metric for regression tasks and is the square root of the averaged squared difference between the target value and the value predicted by the model. It is preferred more in some cases because the errors are first squared before averaging which poses a high penalty on large errors. Hence RMSE is useful when large errors are undesired.

To learn more about evaluation metrics for regression:

from sklearn.metrics import mean_squared_error, r2_scoreRMSE={}
R2={}

Linear Regression Model

Linear regression is a common Statistical Data Analysis technique. Linear regression models are used to show or predict the extent of the linear relationship between a dependent variable and one or more independent variables.

There are two types of linear regression, simple linear regression and multiple linear regression. We will be using multiple linear regression since we have multiple independent variables and a single dependent variable.

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(X_train, y_train)y_pred = lin_reg.predict(X_test)r2 = r2_score(y_test, y_pred)
print('R2 score for Linear Regression:', r2)
R2['lin'] = r2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE score for Linear Regression:', rmse)
RMSE['lin'] = rmse

An R² score of 0.5845 is now our baseline score for comparing other models.

We now attempt to fit a robust regressor in order to treat outliers in case that is hindering our model fitting.

Out of RANSAC, Huber and ThielSen, we chose Huber Regressor.

Huber(Robust Regression)

Huber regression (Huber 1964) is a regression technique that is robust to outliers. The idea is to use Huber loss function rather than the traditional least-squares; we solve

for variable β∈Rn, where the loss ϕ is the Huber function with threshold M>0,

This function is identical to the least-squares penalty for small residuals, but on large residuals, its penalty is lower and increases linearly rather than quadratically. It is thus more forgiving of outliers.

from sklearn.linear_model import HuberRegressorhuber = HuberRegressor(max_iter=1000)huber.fit(X_train, y_train)y_pred = huber.predict(X_test)r2 = r2_score(y_test, y_pred)
print('R2 score for Huber Regression:', r2)
R2['huber'] = r2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE score for Huber Regression:', rmse)
RMSE['huber'] = rmse

As the R² score has reduced(0.4381) it is possible that in the process of avoiding outliers our model has ignored some high leverage points causing a drop in the score.

Assuming that the model is overfitted to the training data, we try regularization to prevent overfitting.

Out of Lasso(L1), Ridge(L2) and ElasticNet(Lasso+Ridge), we chose Lasso.

Lasso(L1 Regularization)

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients.

This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and be eliminated from the model. This ensures that the model considers only the variables that have a significant impact on the dependent variable i.e. it is performing an embedded method of feature selection.

from sklearn.linear_model import Lassolasso = Lasso(random_state=42)lasso.fit(X_train, y_train)y_pred = lasso.predict(X_test)r2 = r2_score(y_test, y_pred)
print('R2 score for Lasso Regression:', r2)
R2['lasso'] = r2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE score for Lasso Regression:', rmse)
RMSE['lasso'] = rmse

The R² score is nearly the same(58.35) as the linear regression model. We suspect that this is because we are trying to fit a linear model on a non-linear dataset. To overcome this we try fitting a Lasso Regression Model with Polynomial Features.

Polynomial Regression: extending Linear Models with basis functions

Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modelled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted by E(y |x).

Using PolynomialFeatures from the sklearn package, we extend a Lasso Regression Model by constructing polynomial features from the existing features.

from sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import Lassofrom sklearn.pipeline import Pipelinemodel = Pipeline([('poly', PolynomialFeatures(degree=2, interaction_only=True)),('linear', LinearRegression(fit_intercept=False))])model = model.fit(X_train, y_train)y_pred = model.predict(X_test)r2 = r2_score(y_test, y_pred)
print('R2 score for Polynomial Regression:', r2)
R2['poly'] = r2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE score for Polynomial Regression:', rmse)
RMSE['poly'] = rmse

We have gotten a satisfactory R² score of 0.8833.

However, until now we have used only distance-based models for predicting the number of suicides. We now look at a non-euclidean model, such as a Random Forest Regressor Model.

Random Forest Regression

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression. It works on the ‘bagging’ technique and each tree in the random forest runs parallelly. It operates by constructing a multitude of decision trees at training time and outputting the mean prediction (since we are performing regression) of the individual trees. The results from each tree are aggregated, through averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.

from sklearn.ensemble import RandomForestRegressorrf = RandomForestRegressor(random_state=42)rf.fit(X_train, y_train)y_pred = rf.predict(X_test)r2 = r2_score(y_test, y_pred)
print('R2 score for Random Forest Regression:', r2)
R2['rf'] = r2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE score for Random Forest Regression:', rmse)
RMSE['rf'] = rmse

We have gotten a great R² score of 0.9889 showing that a non-euclidean model performs better than the other distance-based methods we had used earlier.

Evaluating Models

Evaluating the performances of all the models by R² score and RSME value.

sns.set()plt.figure(figsize=(20,10))plt.subplot(211)
plt.bar(range(len(R2)), R2.values(), align='center')
plt.xticks(range(len(R2)), list(R2.keys()))
plt.title('R2 Score')
plt.ylim(0, 1.2)
for i, v in enumerate(R2.values()):
plt.text(i, v + 0.01, str(v), color='black',
horizontalalignment='center')
plt.subplot(212)
plt.bar(range(len(RMSE)), RMSE.values(), align='center')
plt.xticks(range(len(RMSE)), list(RMSE.keys()))
plt.title('RMSE')
plt.ylim(0, 700)
for i, v in enumerate(RMSE.values()):
plt.text(i, v + 10, str(v), color='black',
horizontalalignment='center')
plt.show()

We see that Random Forest Regression has the highest R² score and lowest RMSE value implying it has performed the best of all the models.

Of the distance-based methods, Lasso Regression with Polynomial Features has the highest R² score and lowest RMSE value implying it has performed fairly well.

Understanding the results

As seen above, the random forest model gives us the best results. To understand which features have been given more importance by the random forest model, we plot the 10 most important features.

import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
feature_importances = pd.Series(rf.feature_importances_, index = X_train.columns).sort_values()[-10:]sns.set()
matplotlib.rcParams['figure.figsize'] = (8.0, 5.0)
feature_importances.plot(kind = "barh")
plt.title("Importance in the Random Forest Model")
plt.xlim(0, 0.55)
for i, v in enumerate(feature_importances):
plt.text(v+0.005, i, "{:.4f}".format(v), color='black',
verticalalignment='center')
plt.show()

We see that population has the most significant impact on the number of suicides with an importance of 0.4266.

Similarly, it can be seen that sex_male has an importance of 0.1334. Since the sex column (which had only two values) was one-hot encoded and the first column was dropped, sex_male is the only column in our X that gives us information about gender. Thus sex is the second most significant factor in predicting the number of suicides.

The entire code for the project can be found here:

--

--