What did we learn implementing GPT-2 and BERT for Conversational AI in Healthcare?

Nathan Chapman
Inception Health
Published in
12 min readOct 8, 2020


Healthcare is a data-centric field that would seem ideal for implementing novel AI methods and tools. For everything from multi-dimensional and complex record, to data-driven clinical decision processes, to the large use of modern technology, AI seems to have everything it needs to have an impact on healthcare practices.

Nonetheless, introducing a new algorithm to the healthcare field is no easy task. AI engineers must first ask a few questions, including — 1) How do we test the tool’s performance? 2) How do we demonstrate it is safe? (No more harmful than current practices) 3) How reliably does our tool need to perform? Each question will have different answers depending on the algorithm developed, and some cannot be proven to be safe or useful unless implemented in randomized controlled trial — the gold standard for evaluating drugs and medical devices. Developers must ensure that their model is both safe and reliable, and the model must also not react negatively when prompted with a new situation.[1]

Another issue that developers run into with training models using data generated by current healthcare system is bias. Bias is a challenging and almost an unsurmountable limitation with the present tools and data. The most potent risk is that AI — by being trained to recognize decision’s patterns from historical data — can amplify at scale and automate health inequalities and disparities. You cannot have a model that treats one demographic with identical diagnostics different than that of another demographic. However, one of the most widely used models in healthcare has been proven to have a racial bias.[2] This particular algorithm is designed to generate a risk score for an individual, and based on that risk score, decide whether or not to recommend an individual to enroll in a program that would benefit their health. This tool is used for healthcare workflows on millions of US patients, and it was recently found that it was discriminating against black patients. So while two individuals may have the same health conditions, a white patient is more likely to get a higher risk score — which allows access to better health services— than a black patient. This is just one example of how important analyzing bias is when designing AI.

In addition, there are legal restrictions that impede the ability to create these tools in an Agile way. While the healthcare industry may have immense amounts of data, a lot of it isn’t usable due to regulatory restrictions. A patient’s health information is protected by several rules such as HIPAA. This leads to a ripple effect that restricts access and hence the usefulness of the data being generated in Healthcare settings. There has been a lot of progress to ensure that the data is anonymized and the algorithm cannot produce results that can be traced back to an individual.[3] While it may seem like a simple task, there is no efficient or fool-proof way to ensure these results are fulfilled. This makes Healthcare an anemic environment to conduct AI research. Many research groups are trying to solve this problem; but not yet at the scale and the velocity of other industries.

Due to these restrictions, the development of medical AI has been limited to a few research projects and a few AI startups. The openness and the data-rich ecosystem of general AI is not being replicated in Healthcare.

Research of Conversational AI in Healthcare

Taking those challenges into consideration, I spent the first few weeks of my internship understanding the ecosystem of AI and Healthcare. One topic that became interesting was the combination of recent advances in conversational AI and the focus of our Inception Labs in automated patient-guided experience for clinical decision.

Conversational AI algorithms are designed to answer questions and help guide a person based on prompts. These algorithms have been developed to interact with humans through a chat-based interface, such as one seen used by modern instant messaging applications. In Healthcare, conversational AI can enable an enhanced and personalized patient experience and scale data access, transparency, and health literacy. Patients can ask questions and get personalized answers. Chatbot can be used to check-in on the patients remotely while they are at home. By providing a chatbot, companies provide patients with assistance, feedback, and encouragement to alleviate any feelings of detachment during patient support.[4]

Conversational AI — an introduction

Conversational AI was originally imagined in 1950, by British computer scientist Alan Turing. He created an evaluation for algorithms called the Turing Test. In essence, you have a human talking to both a computer and another human. By asking the same questions, the first person tries to figure out which presence is a computer, and which is the other human. If the user cannot tell which is which, the algorithm is then said to pass the Turing test.[5] Using modern-day technology, it is becoming easier to pass the Turing Test.

One of the first conversational algorithms developed in Healthcare was called Eliza. Created by Joseph Weizenbaum in 1966, Eliza used 200 lines of code to examine the text for keywords, apply values to that, and transform it into an output. Depending on the script, the keywords and their values vary, and the transformation also altered.[6] This algorithm was also one of, if not the first, algorithm to pass the Turing Test.[7] Eliza was made to understand and converse in a human-like fashion amongst certain topics, making it hard to distinguish the difference between the computer and a human. However, most other topics weren’t programmed to, leading to a much less human-like interaction, and therefore failing the Turing test more generally.

How can Conversational AI be applied to healthcare?

By moving some care online, whether a video call or a messaging app, a lot of resources can be eventually automated, optimized, and scaled to ensure that the best, safest care is applied to all patients. Using a conversational AI can improve this efficiency, as a patient can contact a digital “avatar” 24/7.[4] A chatbot based on Conversational AI, or a combination of conversational AI and a human agent, can answer questions, gather information, schedule appointments, or even be a personal health assistant.

How do you create a Conversational AI in Healthcare?

The same process of developing a conversational AI in general can be applied in Healthcare. There are currently many tools and framework to train conversational AI, each with its advantages and disadvantages. There are simple methods, where you map input directly to the output, and there are more complex methods where you train an algorithm to recognize patterns in input to generate its output.[8]

Scale from 1–5, with 1 being the worst, 5 being best.

During my internship, I planned to develop an end-to-end pipeline for Conversational AI that is based on the generative-based model. As I discussed above, the first step was to understand the realm of Healthcare; the second was to understand the progress made over the last few years in terms of conversational AI. Taking both domains as an input, I took the journey to implement a prototype and delve into the practicality of Conversational AI in Healthcare.

Creating such pipeline requires a lot of thought and planning. We had to ensure that our model will respond to questions and statements both accurately and appropriately, scales well, and conforms to all regulations concerning the data used to train and given during a conversation with a patient. A generative model would fit our use case best, as we have a lot of different bases to cover, and we want to ensure that it provides a unique and personable experience.

Recognizing the fact that building a whole chatbot from scratch would be a very intense task, we chose to implement open-source solutions. These solutions would allow us to focus less on the structure and nitty-gritty of the bot, and more on the fine-tuning and performance of the model through transfer learning. Transfer learning is a development technique for fine-tuning already trained models for a new or more specific domain. By fixing weights of earlier layers in the model and allowing the new dataset to change the later layers, you enable the model to keep the knowledge gained initially, while also learning from and preventing overfitting on the new dataset.

While exploring our options, we found numerous solutions that provide either syntax for a model you must train on your data or even a pre-trained model that you can use and fine-tune if you want. Each of these solutions provides a unique approach. Tons of pre-trained models offer an easy bootstrap steps, but since they are designed to be universal, they often lack knowledge or specific information about certain topics or interactions. There are a few that we specifically looked at, and they offered unique advantages to our use case.

ParlAI to the rescue

The two models I’ll be diving into in this blog are both provided by HuggingFace through the framework developed by ParlAI. ParlAI is a handy library that includes datasets, models, and many more options relating to conversational AI. By implementing the ParlAI framework, you gain access to multiple open-source projects, allowing you to focus more on your use cases and development. I was surprised how little these open-source projects were used in healthcare. We explored PersonaChat and a Bert powered Q and A bot. Both of these models were pre-trained and can be fine-tuned with domain-specific data. More information on both of these models below.

Using a Persona

There are many ways that you can design a bot to feel more personable and interactive. By augmenting each response with a persona combined with a history of the current chat, you get a conversation that feels a little bit more human, as the bot will have a sense of self. A persona is just a few sentences that explain what the bot ‘does.’ For example, one of the personas is “I do this. I like that. I have this. I enjoy doing this other thing.”: By having this data as a pre-condition (e.g. metadata) we can inject in the bot some contextual knowledge about the persona to make it feel more human and can answer questions about what it does. However, in our use case, this has minimal added value in the current state, considering our goal isn’t to have an intensely human-like interaction, rather one that is valuable in the sense that it helps patients get the information and support that they need.

We did explore options and possibilities that would add value via this model. The two main ideas we explored were augmenting the persona with a doctor’s persona (“I am a doctor. I practice medicine. I have been a doctor for x years. I work at Froedtert and MCW”) or even using a patient’s medical file as a ‘persona’ so that the bot can answer questions based on the patient’s history and conditions. Both of these ideas require additional exploration, which is beyond the scope of my internship (Hint: I planning to continue this research).

Having a doctor as a character would only allow for an additional human feel to the model, and it doesn’t augment responses based on medical questions and terms. This would require additional data to fine-tune the model. There is currently no medical conversation data openly sourced and available to us, so that would require additional time and resources to create the dataset.

Using a patient’s medical file is also a very intriguing and novel idea to explore. By using the medical history of a patient to augment responses we can ensure that the conversation is personalized and unique, but there are issues that we run into with this idea. For example, we would need to train and validate our model using patient sensitive data, which as discussed is not easy and would require more in-depth considerations to the data handling, the ethical consequences, and the potential side effects of such approaches. Additional security measures would need to be taken, which is something outside of my scope currently. However, the idea itself has a lot of promise and is worth exploring.

Using a Q and A Bot

We decided to use HuggingFace transformed based on a pre-trained BERT model to answer questions based on the text provided.

The first thing that stood out was that it was super easy to set up and use. You download the library in python, and then give it the provided tokenizer, text to answer questions off of, and then some syntax to pass to get a result from the model. It is also astoundingly intuitive. We provided an anonymized medical exam record, and it was able to answer some questions that were both based on the text and could be inferred from the text. Another valuable feature is that we can fine-tune this model to optimize to the clinical task that we are solving. We can train the model and fine tune it with labeled clinical data to fit our use case. Overall, we were very optimistic that this AI framework can be used to “ingest” a medical record; and a patient can ask questions and get valuable answers based on the medical history.

Nonetheless, there were some issues with this model. One of the problems that we ran into is that the maximum character length for the input text is limited to 512 characters. That means that you couldn’t fit a full anonymized medical report as a prior. .

This clip was taken from the Distilled Bert Q and A model after given text from an anonymized prostate exam.

What did I learn?

I will conclude by summarizing both this research and what I learned this summer. To begin, I learned a lot about the complications, implications, and applications that AI has in the realm of Healthcare. I also learned a lot about neural nets and training models. I got a much better understanding of conversational AI.

I personally conclude that researchers should develop an open-source healthcare chatbot dataset using ParlAI that can be provided to the community at large to enable the development of robust, industry-wide conversational AI ecosystem.

Healthcare and AI have quite a complicated relationship. While there are tons of data generated everyday by healthcare providers, nearly none of it can be used for training algorithms. However, there have been algorithms that use desensitized data to create “fake” patients with realistic health conditions. This data is useful, but it is hard to get. I also noticed that healthcare industry is an amazing domain for disruption, AI automation, and digital transformation. The reason more solutions aren’t out there is that there are many regulations and processes that each product must go through before it can be put into practice, and even then, not all these tools are useful.

Conversational AI is an incredibly intricate field. There are many models that can be fine-tuned to our liking, but so little of them are production-ready for our use case. I provided two examples of incredibly useful algorithms that provide solutions, but neither are even close to being ready for release.

Finally, I recommend that the community collaborates to build and release an open source dataset focused on healthcare conversations AI using ParlAI framework. From what I have learned so far, ParlAI is the most useful library for conversational AI, as it has tons of datasets and pre-trained models. Therefore, if we include a dataset that provides chatbot usable data for healthcare, we can train or fine-tune a healthcare conversational model and leapfrog healthcare to the latest development in conversational AI. This would allow us to train models such as BERT and GPT-2/3 and hopefully would allow us to release these tools in a real healthcare setting to help patient and improve health outcomes.


1. Ranschaert ER, Morozov S, Algra PR. Artificial Intelligence in Medical Imaging: Opportunities, Applications and Risks. Springer; 2019.

2. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366: 447–453.

3. Wen A, Fu S, Moon S, El Wazir M, Rosenbaum A, Kaggal VC, et al. Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation. NPJ Digit Med. 2019;2: 130.

4. Fadhil A. Beyond Patient Monitoring: Conversational Agents Role in Telemedicine & Healthcare Support For Home-Living Elderly Individuals. arXiv [cs.CY]. 2018. Available: http://arxiv.org/abs/1803.06000

5. Nicolas Bayerque G. A short history of chatbots and artificial intelligence. In: VentureBeat [Internet]. VentureBeat; 15 Aug 2016 [cited 1 Jul 2020]. Available: https://venturebeat.com/2016/08/15/a-short-history-of-chatbots-and-artificial-intelligence/

6. Epstein J, Klinkenberg WD. From Eliza to Internet: a brief history of computerized assessment. Comput Human Behav. 2001;17: 295–314.

7. Weizenbaum J. ELIZA — a computer program for the study of natural language communication between man and machine. Commun ACM. 1966;9: 36–45.

8. Yao M. 6 Technical Approaches For Building Conversational AI. In: TOPBOTS [Internet]. 11 Sep 2018 [cited 26 Jun 2020]. Available: https://www.topbots.com/building-conversational-ai/

9. Wolf T. 🦄 How to build a State-of-the-Art Conversational AI with Transfer Learning. In: HuggingFace [Internet]. 9 May 2019 [cited 21 Aug 2020]. Available: https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313

10. Sanh V. 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT. In: HuggingFace [Internet]. 28 Aug 2019 [cited 21 Aug 2020]. Available: https://medium.com/huggingface/distilbert-8cf3380435b5