Studying Suicide Rates from 1985 to 2016 : Prediction using Machine Learning

Sahaana Das
8 min readOct 12, 2019

--

The World Health Organization (WHO) has called on nations throughout the globe to make suicide prevention a “Global Imperative.”

Every year, close to 800,000 people succumb to their worsening mental health and choose suicide as a rescue. Ironically, despite such alarming figures, most countries still do not have a national strategy in place to prevent suicides. This is a whole new catastrophe in itself as the facts state that “for every one person who dies due to suicide, 20 or more people are attempting it.”

Given the existing disastrous situation, the world’s biggest healthcare administrative bodies are recommending that suicide prevention should be achieved by the systematic consideration of “risk and protective factors and related interventions.”

In a world, where technology peeps its face in every second aspect of human lives, it would be a shame if the very same technology did not deliver a way to bring an insight indicating potential victims of suicide.

Thus, keeping in mind the above facts, this blog uses Machine Learning algorithms to predict suicide rates on analyzing and finding signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum. The dataset used is provided by https://www.kaggle.com/.

Introduction

Let us begin by briefly understanding the technology we are using. Books describe Machine Learning as “a subset of artificial intelligence which focuses mainly on machine learning from their experience and making predictions based on its experience.”

In layman terms, we find a way to enable computers or machines to make data-driven decisions rather than being explicitly programmed for carrying out a certain task. These programs or algorithms are designed in a way that they learn and improve over time when are exposed to new data.

The statistics above help in deciding relevant factors for effective prediction.

Dataset

In our problem, the data that should be feeded for the machine to decide and predict effectively has to be measure of variability in depressive symptoms along with other relevant factors such as younger age, mood disorders, childhood abuse, and personal and parental history of suicide attempts, etc.

Columns in csv file containing overview from 1985–2016:

  • Country
  • Year
  • Sex
  • Age
  • Number of suicides
  • Population
  • Suicides/100k population
  • Country-year
  • HDI for year
  • GDP per year ($)
  • GDP per capita ($)
  • Generation

Before loading the dataset into our code, we first import the necessary libraries.

Numpy is a package in Python used for Scientific Computing ; matplotlib.pyplot is a plotting library used for 2D graphics ; Pandas is the most popular python library that is used for data analysis.

Next, we import the dataset.

The dataset is in form of a csv file containing data in the above mentioned columns,

Our model follows Supervised Learning, which consists in learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.

While assigning values to X, we drop some columns which we do not require or which are less relevant to our model while predicting the output.

Investigating Correlation

Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age, sex and number of suicides.

This involves investigating the connection between the scatterplot of bivariate data and the numerical value of the correlation coefficient.

We observe that no two variables are linearly correlated.
We are considering both as some countries like the Soviet Union had a high GDP per capita but did not distribute the wealth.

Checking for Outliers

Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

We check for outliers in the input labels and data by plotting scatter plots of the columns.

Columns in the dataset
We observe there are outliers above suicide rates of 125 and over based on GDP and HDI.

Since we observe there are outliers above suicide rates of 125 and over based on GDP and HDI, it is preferable to drop them.

Data Preprocessing

Steps in Data Preprocessing

Data is preprocessed as per the model deployed. The generalized preprocessing we do initially is as follows.

We replace the commas from the values for the data to be converted as float.

A machine learning pipeline is used to help automate machine learning workflows. They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.

Below, we pipeline steps to fill in missing values with the mean of the values, scale and normalize the values and encode the values using One Hot Encoding.

The rest of the dataset is preprocessed as required by the respective models deployed.

Splitting the Dataset

As we work with datasets, a machine learning algorithm works in two stages. We have split the data around 20%-80% between testing and training stages.

Under supervised learning, we split a dataset into a training data and test data in Python ML.

Trying Linear Regression

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output).

Performance Evaluation

There are various metrics that can be used to evaluate the performance of a Linear Regression model. We will use the RMSE (Root Mean Squared Error) value, which is a frequently used measure of the differences between values (sample and population values) predicted by a model and the values actually observed.

Trying Support Vector Regression

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

Performance Evaluation

Trying Decision Tree Regression

Decision Tree is a decision-making tool that uses a flowchart-like tree structure or is a model of decisions and all of their possible results, including outcomes, input costs and utility.

Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables.We can see that if the maximum depth of the tree (controlled by the max_depth parameter) is set too high, the decision trees learn too fine details of the training data and learn from the noise, i.e. they overfit.

Performance Evaluation

Trying Random Forest Regression

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.

Performance Evaluation

Comparison of Results by Different Algorithms

Let us take a look at the collected RMSE values by different algorithms used.

As it is evidently visible, the RMSE value is the least for the Random Forest Regression for the testing set and quite an optimal value for the training set as well, thus, it performs the best for our model.

Scope of Improvisation

The accuracy of the model can be further improved by Backward Elimination. It is a stepwise regression approach, that begins with a full (saturated) model and at each step gradually eliminates variables from the regression model to find a reduced model that best explains the data.

Conclusion

The blog was aimed at explaining how different machine learning algorithms can be used in predicting suicide rates based on relevant factors collected in the dataset. Hopefully, it was a successful attempt in depicting the implementation of rising technology in helping solve a problem in dire need of it.

Find the link to the corresponding video explaining the blog in simpler words :

Watch to get a clearer understanding.

--

--