Medical Advice Generator (MAG); a chatbot designed for medical studies

Andrew Rosenswie
8 min readJul 28, 2023

--

Written by: Kátia M. Barros, Juan Brugada, and Andrew Rosenswie

Introduction:

Germany’s labor shortage has prompted the country to actively seek skilled workers, including medical professionals, to address the issue (source: DW). To facilitate this process, immigration reforms have been implemented at the federal level. However, there is one notable exception to these reforms, which pertains to non-EU medical professionals having to pass a knowledge test known as the “Kenntnisprüfung.”

Germany places a strong emphasis on attracting highly qualified medical professionals from outside the European Union. While the immigration reforms aim to streamline the entry process for skilled workers, the country also places significant importance on ensuring that non-EU medical professionals meet the necessary standards and qualifications. Consequently, the Kenntnisprüfung serves as a means to assess the proficiency and expertise of these professionals before granting them the opportunity to practice medicine in Germany.

One key aspect of the Kenntnisprüfung is that it applies to all medical professionals, regardless of the length of time they have been practicing their respective medical disciplines. The test is administered up to three times, allowing individuals multiple attempts to pass. However, if a candidate fails to pass the Kenntnisprüfung on all three attempts, they are unable to practice medicine within Germany.

Objective:

Our project aimed to provide assistance to these doctors from outside the EU by developing an interactive chatbot, named Medical Advice Generator (MAG).

Methodology:

MAG is equipped with extensive medical knowledge and is trained using question and answer (Q&A) pairings sourced from four publicly available datasets: PubMedQA (Jin et al., 2019), MedQA (Jin et al., 2020), MMLU (Hendrycks et al., 2021) and MedMCQA (Pal et al., 2022).

As MMLU data has a diverse background, the following topics were used: anatomy, clinical knowledge, college medicine, human aging, medical genetics, nutrition, professional medicine and virology. Also, since each data source came in a different format, they were all transformed to the same format of Q&A. For the ones where the original had multiple choices, questions or answers like ‘None of the above’, ‘All of the above’, ‘All of these questions’, ‘Both of the statements are correct’, ‘Which of the following is incorrect?’, ‘Which of the following statements is incorrect’ were deleted. In the end, we collected 570,712 Q&As for MAG’s embedded knowledge base.

To create MAG, we utilized a Python program that leverages llama indexing. Additionally, MAG employs the text-davinci-003 Generative Pre-trained Transformer (GPT) Large Language Model (LLM), provided by OpenAI. This model is remarkable, as it encompasses a staggering 175 billion parameters and has been trained on an extensive dataset of 45 TB of text data.

As aforementioned, we designed a prompt that allowed for tokenized inputs and outputs. Figure 1 illustrates MAG’s design through a flowchart. The maximum input size of a given query was 4096 tokens, whereas the number of output tokens would be 2000 tokens. The text (chunk) overlap was set to 20 tokens, and the text-size (chunk size limit) to 600 tokens. To save the output, we utilized GPTSimpleVectorIndex and stored it in a .json file. Once the query is made by the user, the method “query()” is called on, and then the response is then shown.

Figure 1: Step-by-Step flowchart for MAG.

By utilizing the text-davinci-003 LLM and the knowledge derived from the collected Q&A pairings, MAG possesses a wealth of medical information and is capable of providing accurate and detailed responses to user queries. MAG’s embeddings on diverse medical datasets ensures that it can address a wide range of medical topics and assist doctors effectively. We placed the inference temperature, commonly referred to as creativity level, at 0.1. By setting this temperature close to 0, MAG will respond with precise and concise answers to a given question.

ChatGPT Limitations:

One notable characteristic of ChatGPT is its average inferencing temperature of 1.0. This temperature setting affects the creativity and length of the responses generated by the model, which can be beneficial in certain contexts. However, it also presents a potential drawback. While the responses generated by ChatGPT may provide a summary of information on a given topic, it can fail to provide specific information.

It is important to consider this limitation when utilizing ChatGPT for medical information. Although the model can provide a broad understanding of a medical concept or topic, it may not always furnish nuanced or specialized information that one might expect from a domain expert or a highly specific source. This is where MAG solves this problem.

Another limitation of ChatGPT is its knowledge cut-off. ChatGPT was trained on information as recent as September 2021, which is a constraint when more up-to-date medical information is required. In the case of MAG, its current medical knowledge database can be updated and increased, building upon its current vector index file.

Qualitative Evaluation:

To evaluate MAG’s medical predictions, 36 Q&A pairs were extracted from five medical textbooks (Schmitz and Martin, 2008, Rasul and Syed, 2009, Kumar and Clark, 2011, Longmore et al., 2014, and Collins, 2018). The textbook answers were used as our ground-truth for model evaluations, however this will be discussed more in detail in the following section.

Two medical experts were asked to distinguish the Quality of Information from the responses as ‘very good’, ‘good’, ‘acceptable’, ‘poor’, and ‘very poor’. Additionally, they were asked to classify the performance for MAG’s outputs as ‘green’ (follow MAG’s advice), ‘yellow’ (be cautious with MAG’s response), and ‘red’ (do not follow MAG’s advice).

As shown in Figure 2, the average quality of information given by MAG in 24 out of the 36 questions was ‘good’ or ‘very good’. As for the classification, in 33 out of the 36 the answer was ‘yellow’ or ‘green’.

Figure 2: Average response from two medical experts in relation to 36 answers given by MAG.

Quantitative Evaluation:

To assess the validity of MAG’s output, we posed the same 36 questions from the medical textbooks to ChatGPT and recorded its predictions. We leveraged the HuggingFace platform to calculate eight benchmark metrics commonly utilized in literature, namely BLEU (Papineni et al., 2002), MAUVE (Pillutla et al., 2021), METEOR (Banerjee and Lavie, 2005), ROUGE-1 (Lin, 2004), ROUGE-2 (Lin, 2004), ROUGE-L (Lin, 2004), ROUGE-L_sum (Lin, 2004), and F1-score_avg (Rajpurkar et al., 2016). Using the given metrics, we find that MAG demonstrated superior performance compared to ChatGPT, as shown in Table 1.

Table 1: MAG and ChatGPT-3.5 metrics calculated by comparison with the answers from the textbooks.

Conclusions:

MAG demonstrated superior performance compared to ChatGPT in the task of answering medical questions, thus, helping non-EU medical doctors to study for the test required to become licensed doctors in Germany (Kenntnisprüfung). This superior performance was due to the approach that we used embeddings to allow the LLM (text-davinci-003) to interact with external data (Medical Q&A pairings) and produce more accurate and relevant responses.

Even though we achieved an extraordinary performance using the text-davinci-003 GPT LLM, further improvements to MAG could be achieved by using other more up-to-date models of the OpenAI GPT family such as GPT-4 and to also try open source models like LaMDA from Google or Bloom from Hugging Face.

From MAG’s evaluation process, we have concluded that it is necessary to have human input for the validation process. Medical professionals are required to perform evaluations to MAG’s answers, due to the strict validation process required for AI applications aimed at the healthcare sector. To enhance the credibility of MAG’s responses, future studies should involve a more extensive vetting group comprising a diverse range of medical professionals. This approach will ensure the validity of MAG’s performance.

The quality of answers that we get from MAG is dependent on the quality of external knowledge that we used, in this case the 570,712 Q&A pairings. In order to get more advanced answers from MAG to our queries, we would need to provide more specialized Q&A pairings, for example in particular domains of medicine such as Cardiology, Psychiatry, etc. Through the Llama-index framework that we used, it is possible to add new specialized knowledge and to build on top of the existing index created from the Q&A pairings.

At the moment medical doctors preparing to take the Kenntnisprüfung to become licensed in Germany count on resources in the form of specialized knowledge databases, which contain a wealth of medical information, but are not interactive. This is where MAG will prove to be a very useful tool due to its interactivity and the possibility to grow its current medical domain knowledge base to include more specialized topics.

For more information, please visit our GitHub repository.

References:

Banerjee, S. and Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the {ACL} Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005.

Collins, R. D. Differential diagnosis and treatment in primary care. 6th Edition. Wolters Kluwer, 2018.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR). 2021

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., & Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2567–2577, 2019

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv preprint arXiv:2009.13081 (2020).

Kumar, P, Clark M. 1000 questions and answers in Clinical Medicine, 2nd Edition, Elsevier, 2011.

Lin, C. Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81, 2004.

Longmore, M., Wilkinson, I. B, Baldwin, A., Wallinl, E. Oxford Handbook of Clinical Medicine. Oxford University Press. 9th edition, 2014.

Pal, A., Umapathi, L. K. and Sankarasubbu, M. Proceedings of the Conference on Health, Inference, and Learning, PMLR 174:248–260, 2022.

Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318, 2002.

Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., Harchaoui, Z. MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers. NeurIPS 2021.

Rajpurkar, Pranav, et al. “SQuad: 100,000+ questions for machine comprehension of text. EMNLP 2016-Conference on Empirical Methods in Natural Language Processing, Proceedings, 2383–2392.” (2016).

Rasul, N., Syed, M. Differential diagnosis in primary care. Wiley-Blackwell. 1st Edition, 2009.

Schmitz P. G., Martin, K. J. Internal Medicine, just the facts. McGraw Hill / Medical. 1st edition, 2008.

--

--