Wellness-Watch: Enhancing Counselling Outreach using Machine Learning

Datasigns SFU
SFU Professional Computer Science
12 min readApr 21, 2023

--

Authors: Guneet Kher, Inderjeet Singh Bhatti, Nidhi Kantekar, Siddharth Goradia, Swagata Dutta

Welcome to Wellness-Watch

Motivation and Background

Did you know that in 2019, the World Health Organization reported a health condition that affects 1 in 8 people worldwide. A condition that can strike anyone, regardless of age, gender, or social status. One that often goes undiagnosed, with symptoms that can be invisible to the untrained eye. This health condition can be debilitating, leading to a host of physical and emotional problems. Yet, many suffer in silence, too ashamed or afraid to seek help. What is this disorder? It’s not a virus or an infection, it’s not a physical injury or illness. It’s a health condition that affects the mind — mental health. And while it may not be as visible as other health conditions, it’s just as serious and deserving of attention.

Despite the availability of effective prevention and treatment options, many people are unable to leverage these resources due to various barriers. One of the major reasons for this inaccessibility is the fear of experiencing stigma and discrimination if they come forward and seek help. This fear often prevents individuals from seeking the support they need. Additionally, many people do not realize that they may benefit from counselling or other interventions. These factors highlight the need for expanding the reach of mental health services to support those in need, and that’s where we want to step in.

Problem Statement

We want to leverage machine learning to build a solution that can help improve the accessibility and expand the outreach of such beneficial counselling services to those who may require it. For this we need to break down the barriers mentioned earlier:

Firstly, we need to address the fear among the masses around sharing their mental health concerns. People tend to fear the stigma and judgement they may receive from others, if only there was a way to discuss their thought openly in an anonymized manner. Wait, that’s right there is a way, through Social Media Platforms. Now whether Social media is good or bad is a never ending debate, but we would like to focus on the pros. Platforms like Reddit allow users to post anonymously in communities (‘subreddits’ in case of Reddit), this takes away the fear of being judged and allows users to openly discuss their concerns.

Secondly, nowadays we have many strong text classification techniques which make use of transformer models such as Bert and Roberta. These models are trained on huge amounts of data enabling them to effectively understand statistical structure within natural language sentences. We believe through fine-tuning such models and by making use of relevant data, a model can be trained which can detect pertinence to mental health within a sentence and indicate if counselling support might be needed.

Hence, this project aims to enhance the outreach of counselling and other mental health intervention services to those who might need it. Specifically, we want to answer the following questions:

  1. Can we accurately identify textual social media posts that are pertinent to mental health, including posts related to anxiety, depression, stress, and other mental health issues?
  2. Can we determine if the authors of these posts might need counselling support based on their post content including tone, emotions, and context?
  3. Can we identify any previously unknown subgroups or patterns within these posts that could help health professionals in their diagnosis, such as identifying common triggers, coping mechanisms, or recurring themes?

Data Science Pipeline

Data Collection

We started off by gathering relevant data from public forums, specifically subreddits. Subreddits are online themed discussion forums where users can post content and interact with each other. In this project, we have collected posts (~600k) made to subreddits that were:-

  1. Related to mental health: There are several subreddits that focus on mental health themes such as r/Anxiety, r/Depression, r/SuicideWatch, and other relevant subreddits. The data from these subreddits was labelled “True” as being pertinent to mental health.
  2. Unrelated to mental health: Similarly, there is an abundance of subreddits that focus on other themes that do not contain content regarding mental health, such as r/Sports, r/Technology, r/LegalAdvice etc. The data from these subreddits was given the label “False”.
Frequency Distribution of mental health related subreddits

Data Integration

Since we extracted data from multiple subreddits using a wrapper we built around PushShift API to circumvent API limitations for extracting large amounts of data, we ended up with a large number of JSON files. In this phase, we consolidated these files and formed our final corpus in a CSV file. Although CSV had standard data that you would expect from a post data (user_id, post_time, post_url etc.) our analysis focused on three fields, text content, subreddit and flair.

Data Exploration

After collecting the data, we performed exploratory data analysis (EDA) to better understand the data. This involved identifying hot keywords in the posts through word clouds and frequency distribution techniques. We further focused on understanding the “flairs” in our data and how we could leverage them for our goals. Flairs in reddit allow users to tag posts made with some predefined tags that their post relates to. These flairs are defined for each subreddit.

Frequency Distribution of Flairs

Data Cleaning

The data cleaning phase involved removing noise from our data. Since the data source contained free text, we built a basic NLP pipeline which included tokenization, stemming, lemmatization, stop-word removal, as required by the specific models used (For e.g. topic modelling needed this pipeline, however Roberta based classification did not). The purpose of this phase was to remove any irrelevant or duplicate data, standardize the data, and make it easier to model.

Data Analysis

In this phase, we performed data analysis to better understand patterns and find any hidden topics within the data. We used LDA topic modelling to identify latent topics within the data. This helped us better understand the nature of mental health-related posts and identify any subgroups or patterns previously unknown in the post that could help health professionals with their diagnosis. We further leveraged the data to train classifiers to identify relevance to mental health within textual data and urgency of counselling support.

Word Clouds

Data Product

Finally, we deployed the models through an endpoint that could be integrated into various applications. We built a web application which portrays a social discussion forum integrated with our models at the backend. This platform was designed to allow mental health professionals to quickly gather insights from the posts that have been made by users of the platform with a click of a button.

Methodology:

To achieve our goals, we ran the gamut and utilized several techniques and algorithms, including transfer learning, active learning, and unsupervised learning.

Binary Classification - Transfer Learning:

As the entry point to our analysis, we first needed to classify posts into two classes i.e. related/unrelated to mental health. For this, we built a binary classifier. We chose XLM-Roberta transformer language model for this purpose. This choice came from the fact that we expect social media posts to contain non ASCII character (emojis), and since Roberta uses BPE tokenizers, no characters of the data are discarded as they are treated as bytes. Secondly, XLM-Roberta was trained on multi-lingual data. We may expect traces of non english in some of the social media posts and this model can work well to understand the context.

XLM-Roberta is pre-trained on 2.5TB of data points containing 100 languages in a self-supervised fashion. For our use case, we have fine-tuned the base model through transfer learning using 460k training data points extracted and tokenized from our data collection phase to train a classifier which predicts if given text is related or unrelated to mental health. Further, we identified posts that required urgent counselling support by analyzing the flairs in the data for e.g. flairs such as ‘selfharm’, ‘suicide ideation’ etc. and used these to fine-tune another model which gives us scores that indicate the urgency of counselling requirement.

Multi Class Classification — Active learning:

We believed that the flairs in our data could be very useful for our analysis. However, since assigning flairs to posts is optional in Reddit, more than half of our data did not contain flairs. Further, since our data came from many subreddits, each with their own flairs, our EDA showed there were ~400 unique flairs however, many of them meant the same thing for e.g. there were flairs like ‘Rant’, ‘Vent’, ‘Venting’ etc. and these could all be considered into one generalized bucket ‘Discussion’. We did this bucketing of all flairs into 7 general buckets such as ‘Health’, ‘Discussion’, ‘Work_Life’ etc.

We wanted to build a multi-class classifier based on these to help the counsellor identify the kind of support the person needs. However, our data was not labelled with these buckets, rather it was labeled with the original flairs which we mapped to the bucket. So, we decided to perform active learning to label this data to the new buckets. We referred to the mapping and selected the initial balanced set that is most representative for each of the new 7 buckets manually. Next, we trained a logistic regression model on this initial set and ran predictions against a small unlabelled set. Then we selected the least confident predictions from this iteration and manually labelled them to correct bucket. And then we repeated the training and tagging process for several iterations until the model trained could make more confident and correct predictions.

Topic Modelling - Unsupervised learning:

We used unsupervised learning, specifically Latent Dirichlet Allocation (LDA), for topic modelling to identify latent topics within our data. LDA is a probabilistic model which assumes that each document in a corpus is a mixture of topics, and each topic is a distribution over words.

We explored the best performance of the model with respect to the number of topics by performing grid search on number of neighbours and comparing the coherence score for each iteration of model trained. Once all the iterations were completed, we picked the model that had the highest coherence score. Therefore, by applying LDA to our data, we identified hidden topics within the data, which could further help mental health professionals in their diagnosis.

Evaluation

Now lets get to the good part, the results. Once we had our models trained it was time to infer the results. Owing to the powerful transformer models, our fine-tuned Roberta models when tested against a dataset with 115k datapoints was 97.4% accurate in labelling the posts to the correct classes. Similarly, the urgency classifier which was fine-tuned using 16k datapoints in the training set, when tested against 4k test samples was 97.08% accurate.

Further down the pipeline of the models, our LDA model extracted 10 topics from our corpus, they consisted of probability distributions of words that made up these topics. After some manual analysis of the words in these topic we were able to label each topic. So, our topic model included topics such as “Interpersonal relationships and anxiety”, “education/ work or financial stress”, “anxiety and panic attacks” etc.

Lastly, our multi-class classifier model built on the Logistic regression algorithm which was trained on a dataset that was formed through leveraging active learning achieved an accuracy of 75% and was fairly correct at bucketing unseen posts to the correct buckets that were defined from the reddit flairs.

Left: A post relevant to mental health which does not require urgent intervention Right: Predictions
Left: A post not relevant to mental health Right: Predictions
Left: A post from reddit relevant to mental health and requires urgent intervention Right: Predictions

Data Product

Our data product is a comprehensive solution that includes a frontend and a plug-and-play containerized backend, designed specifically for social media platforms. The frontend serves as a discussion forum where individuals can post about their mental health issues, providing a safe and supportive space for open discussions and sharing of experiences related to mental health.

The frontend is user-friendly and intuitive, allowing users to easily create posts, engage in discussions, and share their thoughts and experiences related to mental health. The interface is designed to prioritize privacy and security, offering options for users to post anonymously or with pseudonyms to protect their identity and promote open and honest discussions without fear of judgment or stigma. The frontend also includes features such as post moderation and content filtering to ensure that the discussions remain respectful and adhere to community guidelines.

The plug-and-play containerized backend of our data product is designed to be easily deployable on various social media platforms, allowing for seamless integration and operation. The backend consists of our machine learning models for identifying social media posts related to mental health and extracting insights to assist with diagnosis.

Lessons Learnt

Throughout building this solution, we learned several new skills:

  1. Data collection can be challenging: Gathering relevant data from public forums can be time-consuming and challenging due to the volume of data and potential noise. To counter these challenges we learnt to implement a wrapper around the PushShift API to work around the records per response limitation when extracting data over a large date range.
  2. Unsupervised learning can reveal hidden patterns: LDA topic modelling was an effective unsupervised learning method to identify patterns and hidden topics within the data. It served as a powerful tool to add to the insights we aimed to deliver through our project.
  3. Transformer models are incredibly powerful: Although training large models such Roberta for fine-tuning to your use case can be time consuming and compute heavy, their performance totally makes it worth the effort. Further, there are several frameworks such as pytorch lightening and huggingface accelerate which can speed up the training and make use of GPUs and TPUs.
  4. Active Learning: Not enough labelled data? Active learning is the key. Although it requires considerable manual intervention, once you build a pipeline for tagging, training and testing it can prove to be a powerful tool and is much better than having to label all of the data manually.
  5. Deployment of ML Models: We learned how to containerize a backend containing machine learning models. We leveraged GCP and deployed the containers in a server-less manner. Further we learned how to push large models such as our fine-tuned Roberta classifiers to huggingface’s model hub in a convenient way and to share our model with other members of the huggingface community.

Overall, we learned that combining various machine learning techniques can lead to an effective and potentially impactful solution to address such a complex problem and improve counselling outreach. Proper data modelling and tuning of models was crucial, and developing a data product like a web application is a convenient way to make the solution accessible to end-users.

Summary

The Wellness Watch project aimed to enhance counselling outreach using machine learning techniques to detect and draw insights from mental health-related social media posts. The primary motivation for the project was to address the issue of mental health stigma, where individuals may not feel comfortable seeking counselling support. We collected relevant data from Reddit, performed exploratory data analysis to identify hot keywords, cleaned the data using NLP techniques, and performed Topic modelling using LDA to understand patterns and hidden topics within the data.

To detect posts that may require counselling support, we used transfer learning to fine-tune a pre-trained XLM-Roberta model on 460k datapoints extracted from various relevant subreddits and achieved a high accuracy of 97% when tested against 115k test samples indicating its effectiveness in identifying mental health-related posts and also predicting the urgency of counselling need. We then implemented active learning using a logistic regression model to classify posts into the 7 buckets that we defined representing the categories these posts fall in. Using this way of classification into 7 general classes by leveraging the flairs associated to the posts by authors themselves enables us to convey further distinguishing insights to health professionals even though we ourselves do not have the specialized domain knowledge.

Lastly, the data product was delivered in the form of API endpoints that can be integrated into various applications and established social media platforms such as twitter and reddit, enabling mental health professionals to quickly identify a large scale of people that could benefit from counselling in order to intervene early and provide personalized support.

Using this project, our vision of bringing down the barriers and tackling this health condition is realized by leveraging state of the art machine learning models and doing data science for good.

--

--