Data Analysis of the silent pandemic: Deaths by Suicide — Part 1

A deep dive into India’s 2019 suicide data using Correlation heatmap and K-means clustering

Published in

Analytics Vidhya

7 min readJun 30, 2021

Since January last year, Covid related statistics have been visible everywhere and like most of us I have been following them as well. India reported 150,036 Covid deaths until 31-Dec-2020. There has been another silent pandemic that has been raging underneath, the counts for which are not in our faces every day. If observed closely, we can spot them as small news items everywhere.

If you have not guessed yet, I am talking about suicides. For the year 2019, India reported 139,123 deaths by suicide, which is by no means a small number. Through this series, we will explore India’s 2019 suicide data and glean insights.

What are we analyzing here?

NCRB (National Crime Records Bureau, India) publishes suicide data at every Indian state level — by cause, age, profession, education, economic and social status. The latest data available is for 2019 at the time of writing this article.

In this part of this series, we will explore if the state wise suicide numbers correlate with any of the socio-economic factors for the states. We will get deeper into cause-wise, age-wise, education-wise visualization and insights in later parts.

I have taken :

- 2019 suicide data from NCRB

- State wise economic and development indicators such as literacy rate, total fertility rate, states’ net per capita domestic product (NSDP), Gini coefficient (a measure for economic equality), Unemployment rate, Mean Asset score and Percentage Below Poverty Line (BPL) from other government survey reports and census data

-Alcohol consumption and substance abuse data from a report published by Ministry of Social Justice and Empowerment in 2019

Exploring the data

Salient information from EDA

· The dataset has 36 rows (29 states and 7 union territories) and 25 columns before any clean-up or feature selection

· 5 columns had null values

· Serial number and all columns containing absolute suicide numbers and state’s population were dropped. This reduced the number of columns to 18.

· Suicide rate is our main feature of interest. It is defined as the number of suicides per lakh (100,000) population. Lakshadweep has the least suicide rate of 0 and Andaman & Nicobar islands has the highest suicide rate of 45.5.

· Nulls were imputed with median for columns missing values.

Correlation between features

Given below are the correlations between the 18 features represented on a seaborn heat map. The features that have very high correlation can be seen in bright yellow or deep blue, depending on the direction of the correlation.

As a next step, highly correlated features were removed. The variable representing Top 10 states with alcoholics needing help (‘Top10_alcoholics_needing_help’) has almost 0 correlation with suicide rate, hence that was also dropped. That variable has 0s for all rows except 10 rows.

Correlation heatmap after dropping all correlated variables

For the purpose of this analysis, correlation above 0.7 and below -0.7 are considered to be strong and correlation between 0.5 to 0.7 and -0.7 to -0.5 are considered to be moderate.

Some of these socio economic and development indicator correlations are interesting to observe as I have taken data from disparate sources and yet they do sort of come together and make sense.

Well…. almost all but suicide rate. In a bit we’ll see where this one went against my expectation and why.

Strong Correlation

1. Gini coefficient is a measure of distribution of income across a population. 0 Gini coefficient implies perfectly equal income distribution and 1 implies one resident earned all the income and the rest didn’t earn any.

From this heatmap, it can be observed that Gini coefficient has a strong negative correlation with Mean asset score, Per Capita NSDP and Literacy rates.

2. Gini coefficient has a strong positive correlation with Percentage BPL (% of people below poverty line in the state).

3. Mean asset score has a negative correlation with Percentage BPL.

4. Various substance abuse scores are all strongly correlated with each other.

5. Per capita GDP and Mean asset score have a strong positive correlation.

Moderate Correlation

1. Gini coefficient and total fertility rates have a moderate positive correlation.

(Total fertility rate is the average number of children that would be born to a woman over her lifetime. To sustain population levels, it needs to be at 2.1. Anything lesser, population declines and anything more than that, population increases)

2. Total literacy rate has a moderate positive correlation with Per capita NSDP and Mean asset score.

3. Total fertility rate has a moderate negative correlation with Per capita NSDP and Mean asset score.

4. Percentage BPL has a moderate negative correlation with Per capita NSDP and Literacy rate.

Suicide rate correlation with other variables:

1. Suicide rate shows a weak negative correlation with Total Fertility Rate (-0.44) and Gini coefficient (-0.38). This implies that in states where fertility rates are higher and there is more inequality in income, suicide rates are lower.

2. It shows a weak positive correlation with Alcoholic Total Percentage (0.38) and per capita NSDP (0.4).

At overall India level, 66.2% suicides are committed by people earning below Rs.1,00,000 per year. That is the lowest income group. This is why it was surprising to see Suicide rate’s negative correlation with Gini coefficient and a positive correlation with per capita NSDP.

Clustering approach to this data

Clustering is an unsupervised learning technique to group similar data points into clusters based on their similarities. In this section, we’ll cluster states based on suicide rates.

I used Sklearn’s Standard scaler to scale all features as clustering is a distance-based technique. The resultant clusters were not built around suicide rates and hence the variance within the clusters was too high.

I wanted clusters to be formed based on suicide rates or in other words, wanted more weightage to be given to Suicide_Rate feature. So, I increased the weight of this feature 10x times on the scaled dataset.

On running the K-means algorithm for various K values from 2 to 10, the following WSS (Within Sum Squares / Elbow) plot was obtained.

3 and 4 seem good options for K. Looking at the Silhouette scores, 3 seems to fare better.

Using K=3 in SKLearn’s K-means algorithm and obtaining the labels, I grouped the dataset and obtained mean values for all the variables.

Let us look at the 3 clusters sorted on increasing order of suicide rates. The rows are also colour coded as Green, Orange and Red to give an intuitive feel for the clusters.

The 3 Clusters and means of the dimensions for each cluster

Cluster 1

Cluster 1 has the least suicide rate of 4.39 and has 15 states. The literacy rate, % alcohol consumers, per capita NSDP and mean asset score are least in that group. This cluster has a suicide rate way below India’s average of 10.4 per lakh population. Given below are the states that fall under cluster 1 along with corresponding suicide rates.

Cluster 2

This cluster’s suicide rate of 14.65 is higher than India’s average of 10.4 and it has 16 states. The literacy rate, unemployment rate, %alcohol consumption, TFR, mean asset score and Gini coefficient values are between Cluster 1 and Cluster 0. The NSDP for these states is almost same as the NSDP for the highest suicide rate states (cluster 0). The other substance abuse numbers do not show a consistent pattern across the clusters with respect to suicide rates. The following states fall under Cluster 2.

Cluster 0

This cluster comprising of 5 states has the highest suicide rates (32.36) way above India’s average. It has the most literacy rate, least unemployment, highest per capita NSDP, lowest TFR and lowest Gini coefficient. However, it has the highest alcohol consumption percentage.

What do these clusters look like?

Plotted below are scatter plots of suicide rate against few variables that showed some correlation. The 3 clusters are shown in different colors.

Conclusion

In this article, we have looked at how India’s 2019 state wise suicide rate data correlates with some of the socio-economic and development indicators of the states and union territories.

This analysis has revealed insights contrary to the expectation that the richer states would have lower suicide rates.

The correlation heatmap showed a weak positive correlation between ‘Suicide rate’ and ‘Alcoholics total percentage’ and ‘Per capita NSDP’. It shows a weak negative correlation between ‘Gini Coefficient’ and ‘Total fertility rate’. A cluster analysis on clusters formed through K-means algorithm also shows a similar pattern.

Couple of things to note here: these correlations are weak and correlation does not imply causation. In the next part of this series, we will look into the suicide data from a cause angle.

Data Analysis of the silent pandemic: Deaths by Suicide — Part 1

A deep dive into India’s 2019 suicide data using Correlation heatmap and K-means clustering

Clustering approach to this data

Written by Kalaivani K.G