Suicide Prediction — Part 1
As the final days are near at General Assembly, I have been working on my capstone project for the past few weeks. I particularly enjoyed that I was able to pick a subject and build a data science case of stating the problem and coming up with my own take. I picked a subject that not many people discuss publicly — mental health.
I am not one to strongly express my ideas or thoughts unless I need to, but mental health is one area that I deeply care about and advocate freely. It is crucial to our well-being as much as physical health, maybe even more so; however, this is that one area we restrain ourselves from talking about. I understand that the society we live in promotes positive thoughts and feelings often, and suppresses negative thoughts and emotions. Because of this, we often do not pay much attention to what matters when it comes to mental health, and we build a stigma around it — people cannot ask for help when truly needed, and it is easier for people to turn a blind eye. This impacts many aspects of our society: homelessness, substance abuse, and suicides. Suicide has been in the top 5 cause of deaths since 1996, and it recently took second place in 2018 per the CDC. Now, we have a global pandemic on our hands on top of that. Due to the ongoing COVID scare, an article in the conversation suggests that it could also lead to a mental health epidemic.
So, let’s take a look at what I found.
Data Collection and Cleaning
I decided Reddit would be a great source to get what people talk about when it comes to mental health issues due to its anonymity and subject oriented forum. I wanted to web-scrape the posts from January 1st to June 15th so I could see how the trend has been from the start of the year to 3 months in after the quarantine. I was unable to do that successfully in a timely manner due to the number of requests that I was sending out (I started getting error code 429 — the most scary number I encountered) and the time restriction of the project. Luckily for me, a number of data scientists already had done similar research earlier this year and published the dataset. I used this dataset and edited for what I needed. The subreddits of my interest were Alcoholism, Anxiety, Bipolar, Depression, Health Anxiety, Lonely, Mental Health, and Suicide Watch.
Data Processing and Analysis
Since I used text data, vectorization of data, as well as counting, the number of words and sentences were the very first step of taking a deep look. I used spaCy library for word and sentence counts, as well as vectorization — however, spaCy vectorization was not as powerful as TF-IDF.
- Impact of COVID-19
One thing that I particularly wanted to see was how COVID reshaped our daily lives and if there were changes in our well being. As expected, people started chatting more about COVID as we started quarantine and states issued stay at home orders in March, and any posts mentioning loneliness increased accordingly.
- Most common words
When analyzing text, the easiest way to examine the data is probably looking at the most commonly used words. I created 2 different formats of visualization — one with a word cloud and the other with TF-IDF in a bar graph. As seen below, “try”, “help”, “friend”, and “talk” were some of the common words that were said by people struggling with mental illness in a word cloud (left), and “end”, “die”, “hate”, and “kill” made the top words selection in TF-IDF vectorization (right).
I will talk about how I created a target label and models that I tried to predict suicidal communication in my next post. Please stay tuned!
Citation:
Low, Daniel M., Rumker, Laurie, Talker, Tanya, Torous, John, Cecchi, Guillermo, & Ghosh, Satrajit S. (2020). Natural language processing reveals vulnerable mental health support groups and heightened health anxiety on Reddit during COVID-19: an observational study. Retrieved from https://psyarxiv.com/xvwcy/