A Faster Way to Annotate Transcript Data in PTSD Therapy Sessions
Learnings to improve the process of annotating data.
Deep learning is worth the hype. You can generate faces indistinguishable from real ones and train models to detect fake news better than humans can. That’s what drew me in to learn computer vision and natural language processing several months ago.
Now fast-forward a couple of months to last June. I was browsing through Linkedin, and came across an interesting link that led me to this:
I decided to apply, and lo and behold: I was accepted! I was super excited to join an awesome team of 40 other researchers and enthusiasts ready to make a change in the world with AI in only 8 weeks! And so we jumped right in :)
What’s the Challenge?
In the kickoff call with Christoph von Toggenburg, he talked about his exposure to Post Traumatic Stress Disorder. PTSD can be triggered when someone experiences a severe traumatic event, and instead of the trauma leveling off, it becomes a mental health condition.
Symptoms include panic attacks, anxiety, uncontrollable thoughts, and more, which can be triggered whenever they are reminded of the event.
“The difference between trauma and PTSD is that switch in your brain, and it becomes a part of your life. It is something you cannot reverse, but you can deal with the symptoms, and if treated properly, you can get much better” — Christoph
Over the last 20 years, Christoph has been a part of humanitarian work with the UN and Red Cross. He often traveled to war-torn places to help the refugees and civilians. He’s seen thousands of victims who have been through traumatic experiences and have been unfortunately unable to receive help due to inaccessibility for therapy services.
Christoph has also experienced PTSD himself; a truck he was riding in while on an emergency mission in the Central African Republic was ambushed. He received treatment for it and experiences almost no symptoms now.
Now, Christoph is starting BEATrauma, an initiative to help victims with PTSD all around the world. His vision is to create a mobile app chatbot to converse with users and determine a risk assessment for PTSD, which would implement machine learning — that’s where we come in!
Omdena — Learning ML Through Collaboration!
Omdena is a global platform where AI engineers and enthusiasts from diverse backgrounds collaborate solve real-world social problems and build a meaningful career.
As a part of a group of 40 other enthusiasts, experienced developers, and mentors from around the world, we were all moved by Christoph’s story. We wanted to make a change and do good with AI. As a team, we began the initial phase of researching deeper into PTSD and different methods for therapy. And boy, were we motivated!
We discovered that CBT (cognitive behavioral therapy) was the best solution. CBT is having a therapist to talk to the patient more about their experiences and “expose” them more until they finally become comfortable with it. Knowing that we could implement a conversational agent in NLP for this purpose, we set our sights on training data.
The Data Problems— Not Annotated, Not Enough
Data is not always easy to find, especially when dealing with sensitive user information like therapy sessions. Our in-house math and data science professor Colton Magnant was able to get his hands on around 1700 transcripts on therapy sessions, about only 50 which were for PTSD.
From there, we split into 2 groups. One was in charge of risk assessment, creating a rule-based algorithm in rasa with sentiment analysis to converse with the user, along with a backend classification model trained on transcript data to determine if the user had PTSD. The other focused on CBT, training a seq-to-seq chatbot for therapy!
I decided to take a step back from NLP and focus on data annotation. Since the transcripts came completely unlabelled, we had to give them a score between 0 to 1 so that the model could learn which patients had PTSD and which didn’t. Luckily, Alexis Carrillo Ramirez, who has experience with statistics and psychology, was able to guide our team of 7 through reading through the transcripts and scoring them!
The Annotation Process
- Understand each of the 6 criteria for PTSD. E.x., Exposure to actual or threatened death, serious injury, or sexual violence, Persistent avoidance of stimuli associated with the traumatic event(s), and more!
- Keeping the criteria in mind, read an entire transcript (which can take from 45 min-1 hr).
- Score each of the 6 criteria with either a 0, 0.5, or 1, of which 0 means not displaying the symptom at all, 0.5 meaning somewhat displaying it, and 1 representing a clear expression of that symptom.
- Follow a formula to take in all 6 numbers and spit out a number between 0 and 1 for the risk assessment for PTSD.
- Rinse and repeat for the other 49.
We faced two problems in our annotation process. The first was that it took far too long to annotate all the data. Through complications and busyness, it took around 2 weeks to finish with tons of hard work put in. The second was that the transcripts were often a bit unclear and difficult to understand.
We brainstormed several solutions to the annotation problem:
- Determine a bag of words and their embeddings for each criterion and run LDA (Latent Dirichlet Allocation) on top of them for classification of each criterion to completely automate the process
- Using USE (Universal Sentence Encoder) to determine the cosine similarity of each sentence to match sentences of the same criterion
- Use GPT-2 to summarize each transcript to get the main idea, speeding up the annotations
Creating the Risk Assessment Chatbot
From there, we had to create a classification model that takes in user conversations and determine if they had PTSD. However, we didn’t have enough data to train our model to make it robust enough. Luckily, from a breakthrough with ULMFiT’s transfer learning technique, we have been able to achieve close to 80% accuracy so far, with more improvements to come!
I Have Learned So Much From This Experience!
When I first joined Omdena, I just understood data from a machine learning perspective. I didn’t know about data engineering or annotations or the tremendous work it would take to clean the data. Back then, I was just grabbing nearly perfectly manicured Kaggle datasets!
Now, I’ve realized that’s not how it works in the real world. Genuine data is messy, difficult to understand, and doesn’t come with documentation. From this challenge, I’ve learned so much about working with data and how to better understand it! Now we’re discussing working on a paper to show our findings and results for data annotation for therapy sessions to the world, which is very exciting :)
I’ve also learned that things don’t always turn out as planned. It’s quite easy to follow a data science course or tutorial and have it work exactly as you’d imagine. However, through working on this tremendous undertaking, I’ve realized that there are always hiccups along the way. We’ve had issues with data, model accuracy, and had to scrap our ideas for CBT due to the culmination of them.
Nevertheless, we have still accomplished a ton and we’re almost ready to push out our risk assessment chatbot for BEATrauma! We’re excited and honored to make an impact in the world and I’m proud to be a part of this Omdena challenge!