Generating biomedical abstracts with GPT-2

Yannis Papanikolaou
Published in
6 min readNov 1, 2019


An example from Healx’s Healnet graph, powered by our NLP tools.


  • We have developed a scientific abstract generator, available at
  • Our abstract generator is powered by OpenAI’s GPT-2 model and fine-tuned on PubMed abstracts.
  • In developing this model, we wanted to better understand GPT-2, as well as its strengths and limitations in understanding language from the biomedical domain.

At Healx, we’re constantly on the lookout for biomedical knowledge we can extract from raw text to assist us in generating our hypotheses towards discovering new treatments for rare diseases. Extracting knowledge from text is inevitably linked to Natural Language Understanding, which is, in and of itself, challenging.

Add to that the complex and constantly-changing biomedical jargon we work with, and you can see how we’re always asking ourselves what the best way is to build strong models that extract entities or relations? And: how do we deal with noise, ambiguities or concept drift over time?

So, a few months ago, when OpenAI’s researchers released their much-discussed language model, known as GPT-2, capable of convincingly generating text, we absolutely wanted to get our hands on it. Go to to try out the scientific abstract generator we developed, using GPT-2.

Given title: Effect of garlic consumption in systemic lupus erythematosus.

Generated abstract

OBJECTIVES: To determine the effect of garlic consumption in systemic lupus erythematosus (SLE) and to evaluate the safety of garlic consumption. METHODS: This was a single-center, randomized, controlled clinical trial. We randomly assigned patients with SLE to receive garlic (G; n = 30) or placebo (P; n = 30) for 12 mo. The primary endpoint was the percentage of patients with an improvement of SLE activity (SLEDAI-2K), which is a composite measure of SLE activity.RESULTS: The G and P groups did not differ significantly at baseline or at 12 mo. At 12 mo, the G and P groups showed no difference in improvement of the SLEDAI-2K (mean difference -0.1, 95% CI: -2.4 to 2.4; P = 0.8). The G and P groups also had significant reductions in the number of adverse events (P < 0.001). No severe adverse events were recorded.CONCLUSIONS: In this study of SLE patients receiving G (n = 30) or P (n = 30), the consumption of garlic for 12 mo did not improve their disease or reduce adverse events compared with the control.

What is GPT-2?

GPT-2 is the second version of OpenAI’s Generative Pretrained Transformer (GPT) model. OpenAI used the new Transformer neural architecture as its building block to build a complex language model that was pre-trained on huge amounts of texts, and with the objective of being able to generate new content. The model can be fine-tuned on new data to solve particular tasks.

But GPT-2 has been just one milestone in a year full of advances in NLP. These advances have mainly come on the back of the Transformer and crazy amounts of hardware horsepower, and have led to one language model after another raising the bar in Natural Language Understanding tasks. First it was GPT, then BERT, followed by GPT-2, ERNIE, XLNet, RoBERTa and most lately ALBERT. Apparently, there are more to come, with even fancier names, though hopefully not from Sesame Street.

And just to understand the extent of these advances, the best performance on the RACE reading comprehension task, rose to 89.4% (Google’s ALBERT in Oct 2019) from 44.1% in 2017.

Unlike the other models, GPT-2 and its predecessor GPT were pre-trained with a focus on generating text rather than solving supervised learning tasks.

What we’ve been doing with GPT-2

In fact, GPT-2 delivered alarmingly good results — so good that OpenAI took the controversial decision to only gradually releasing publicly its models, allegedly to prevent misuse and to better prepare the community for the challenges it brought about.

We initially experimented with the smaller model (124M parameters), but were not happy with the results. But when the third model (774M parameters) became available, we were ready to put our plan into action: we gathered slightly more than 1% (half a million) of PubMed abstracts, and fine-tuned the model on them.

Our goal was to observe how well the model would learn language structure and be able to replicate it. We also wanted to see how novel the generation would be and, a more ambitious goal, to see to what extent the digested abstracts would actually lead the model to learn plausible facts.

What we’ve learned

At the very least, this whole experience has been enlightening with regards to the capabilities of GPT-2, as well as the pitfalls to avoid when fine-tuning or using it, so we decided to share some details here to help other researchers or users in their experiments with it.

For our experiments, we fed titles into our model from articles that had already been published on PubMed and inspected the results:

  • The model can successfully follow its “train of thought” across sentences.
  • It successfully learned abbreviations and sometimes generated new, meaningful ones.
  • It learned to properly structure an abstract into sections, such as Objectives, Methods and Conclusions, as often found in biomedical abstracts. That could be interesting in terms of summarising or extracting factual data.
  • It’s not that good at maths:

…The study was carried out with 1250 participants, of whom 728 and 743 were children and adults, respectively, from Spain…

  • On multiple occasions, it understood ranges of numbers:

…We studied 12 patients with a median age of 44.8 years (range, 21.3–58.1 years); most patients were female (71.4%), were white (87.5%), and had a mean AP duration of 15.9 days (range, 8–23 days). CVVHF was performed for a median of 19.0 hours (range, 8–30.0 hours)….

  • It can successfully come up with new drugs. Well, sort of:

Intravenous and intraperitoneal administration of nafenib improves muscle mass and function in rats with cerebral vasospasm.BACKGROUND: Nafenib, an oral, cytotoxic, cysteine-rich cysteine protease inhibitor, was tested in animal models of cerebral vasospasm by treating animals with nafenib intravascularly or intraparenterally.

The last example comes from an unconditionally generated article. Either way, we’d be really interested in knowing the chemical structure of “nafenib”!

GPT-2 in application

Despite its surprising results, GPT-2 has been relatively little used so far. Tabnine has used it for code auto-completion. Recently researchers used it for summarization, showing that careful fine-tuning can lead to competitive results. Another interesting application has been in creating templates for medical education tests.

We believe that GPT-2’s deeper implications and strengths, as well as its full potential, are not yet fully understood. We also firmly believe that a number of applications will eventually benefit from it, such as biomedical question answering or summarization.

And we believe that the key in all this will be clever fine-tuning. We’re using the scientific abstract generator we created to improve our understanding of the millions of abstracts we process. And, as we said: we process these abstracts to extract biomedical knowledge from them, because, ultimately, everything we do is towards advancing treatments for rare diseases.

In the meantime, we’re getting ready for the next state-of-the-art model.

Technical stuff

  • To fine-tune our model we used gpt-2-simple and to make the 774M model fit into our 16GB GPUs, we used the much useful comments on that Github issue.
  • We tried different parameters, but eventually ended up with a learning rate of 2e-5, only one epoch pass over the data and a maximum sequence length of 256 (that was necessary due to 774 model’s memory restrictions).
  • Fine-tuning on a sole GPU V100 16GB RAM took around 6 days.