A Comprehensive Guide to Predictive Modelling — Part 1

This series aims to explain the process of creating a regression model from preprocessing to model fitting.

Ojasvi Jain & Sidh Satam
6 min readApr 21, 2020

--

In this article, we look at some Data Visualization methods on our dataset to better understand our data.

Dataset Selection

The dataset chosen by us is based on the number of Suicides in different countries, from 1985 through 2016. It also consists of various factors like GDP, population, etc. which may impact the suicide rate. The link to the dataset can be found here. Since the number of suicides is such a pressing issue, it is imperative to understand the factors that may be relevant in cause of the same. Through this case study, we hope to identify significant factors and also, predict the number of suicides using regression algorithms.

Exploring the Data

Before we perform a regression analysis, we must perform Exploratory Data Analysis to understand what the data is trying to tell us.

The visualizations we have performed are inspired by the work done in the kaggle kernel found here.

Before we performed any Exploratory Data Analysis, we have dropped the column of ‘HDI for year’. The reason for this is explained in Part 2 of our series.

Descriptive Statistics

Getting a general idea of the mean and standard deviation of the columns gives us an idea of the distribution of the data.

Pearson Correlation Heatmap

Pearson Correlation tells us how closely the variables are related to each other.

The heatmap allows for easy visualization.

sns.heatmap(df.corr(), cmap=sns.color_palette("GnBu_d"), annot=True, linewidths=0.25)plt.show()

As can be seen, there is a relatively strong positive correlation between the population and the number of suicides per year. Hence more the population of a country, generally more is the number of suicides.

Pairs plot

Pairs plots are used to plot every pair of variables in the dataset as well as show the distribution of the individual variables.

A pairs plot allows us to see both the distribution of single variables and relationships between two variables.

sns.set()sns.pairplot(df)plt.show()

Similar to the correlation heatmap we observe a general rise in the number of suicides in a year as the population rises. No other distinct valuable insights are found.

Number of Suicides vs Year Segregated by Generation

Lmplot is used to visualize a linear relationship as determined through regression.

It draws a scatterplot of the variables(no of suicides and year), and then fits the regression model Suicides ~ year and plot the resulting regression line and a 95% confidence interval for that regression.

The data is segregated by Generation to see different results for the different generations.

g = sns.lmplot(x="year", y="suicides_no", hue="generation", truncate=True, height=5, data=df)g.set_axis_labels("Year", "Suicides No")plt.show()

As seen by the output, even though the number of suicides has some variance, especially for the Boomer Generation and somewhat for Silent Generation and Gen X, the regression line plotted is nearly horizontal and similar for all generations, we can say that Generation does not have a significant impact in the prediction of suicides vs years.

Descriptive Statistics for Generation

f, ax = plt.subplots(1, 2, figsize=(18, 8))df['generation'].value_counts().plot.pie(explode=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)ax[0].set_title('Generations Count')ax[0].set_ylabel('Count')sns.countplot('generation', data=df, ax=ax[1])ax[1].set_title('Generations Count')plt.show()

Boxen plot for Number of Suicides vs Generation

Boxen plot provides a better representation of the distribution of the data than boxplot, but without distorting the appearance of the distribution as would be the case with an improperly configured violin plot.

sns.boxenplot(x="generation", y="suicides_no", color="b", scale="linear", data=df)plt.tight_layout()plt.xticks(rotation=90)plt.show()

Number of Suicides vs Country

suicidesNo = []for country in df['country'].unique():suicidesNo.append(sum(df[df['country']==country].suicides_no))suicidesNo = pd.DataFrame(suicidesNo, columns=['suicidesNo'])country = pd.DataFrame(df.country.unique(), columns=['country'])data_suicide_countr = pd.concat([suicidesNo, country], axis=1)data_suicide_countr = data_suicide_countr.sort_values(by='suicidesNo', ascending=False)sns.barplot(y=data_suicide_countr.country[:15], x=data_suicide_countr.suicidesNo[:15])plt.show()

This graph shows the top 15 countries with the most number of suicides

As can be seen from the barplot that the most number of suicides occurs in the Russian Federation followed by the United States and Japan.

This analysis, however, is not normalized based on the population of the countries.

Number of Suicides per 100k population vs Country

suicidesperpop = []for country in df['country'].unique():suicidesperpop.append(sum(df[df['country']==country]['suicides/100k pop']))suicidesperpop = pd.DataFrame(suicidesperpop, columns=['suicides/100k pop'])country = pd.DataFrame(df.country.unique(), columns=['country'])data_suicide_countr = pd.concat([suicidesperpop, country], axis=1)data_suicide_countr = data_suicide_countr.sort_values(by='suicides/100k pop', ascending=False)sns.barplot(y=data_suicide_countr.country[:15], x=data_suicide_countr['suicides/100k pop'][:15])plt.show()

The same graph, now divided by the population, we can see that the Russian Federation is still at the top however Japan has now moved down and the United States is no longer in the top 15.

Number of readings segregated on Gender and Age Groups

sns.set()female_ = [175437, 208823, 506233, 16997, 430036, 221984]male_ = [633105, 915089, 1945908, 35267, 1228407, 431134]for i, age in enumerate(['15-24 years', '25-34 years', '35-54 years', '5-14 years', '55-74 years', '75+ years']):plt.subplot(3, 2, i+1)plt.title(age)fig, ax = plt.gcf(), plt.gca()sns.barplot(x=['female', 'male'], y=[female_[i],male_[i]], color="#34495e")plt.tight_layout()fig.set_size_inches(8, 12)plt.show()

Joint plot for Suicides/Population vs Year

Jointplot is used for displaying bivariate distribution along with the frequency distribution of each variable.

fig = sns.jointplot(y='suicides/100k pop', x='year', data=df)plt.show()

Joint plot for Suicides vs Year

Jointplot along with the estimated linear regression fit on the joint axes.

sns.jointplot("year", "suicides_no", data=df, kind="reg")plt.show()

Joint plot for GDP per Capita vs Year

Jointplot with Kernel density estimation that shows the distribution of GDP per Capita vs Year as a contour plot

g = sns.jointplot(df['year'], df['gdp_per_capita ($)'], kind="kde", height=7, space=0)plt.show()

Since we have explored our data, and now understand what factors to consider, we proceed to preprocess the data and get our dataset ready for prediction in the next article in this series:

The entire code for the project can be found here:

--

--