CNN insights: Introduction to our dataset and classification problem. Part 3 of 7

Finding patients with a particular disease in a large collection of free-text clinical notes

Noel Kennedy

Series links

Part 1 : Introduction

Part 2 : What do convolutional neural networks learn about images?

Part 3 : Introduction to our dataset and classification problem

Part 4 : Generating text to fit a CNN

Part 5 : Hiding input tokens to reveal classification focus

Part 6 : Scoring token-sequences by their relevance

Part 7 : Series conclusion

Our dataset is the VetCompass™ corpus. This is a very large corpus of clinical free-text notes aggregated from veterinary clinics around the UK. VetCompass™ is quickly becoming one the largest clinical free-text corpora in the world (of any species including humans!!)

Large scale clinical free-text data

Here are some bullet points for you to get some idea of the scale of the VetCompass™ corpus:

  • 9.5m patients (pets) from the UK
  • 55m clinical notes
  • 181m treatment records
  • 2.7bn tokens
  • 1.3m clinical codes applied ‘in-clinic’ (2% of total visits)
  • 220k clinical codes applied retrospectively by researchers
  • A further 18m patients in the import process pipeline

Purpose of the VetCompass™ programme

The VetCompass™ programme is a collection of dozens of ongoing research projects aimed at improving animal welfare:

The Veterinary Companion Animal Surveillance System (VetCompass™) is an international initiative focused on improving companion animal health. This not-for-profit research project is a collaboration between The Royal Veterinary College (RVC) and the University of Sydney through which we aim to investigate the range and frequency of companion animal health problems and identify important risk factors for the most common disorders.

You can read about our current and completed projects, including many of our research outputs on the main VetCompass website. Most of our projects are veterinary epidemiology projects but we also work with computer scientists on methods applied to large scale clinical data.

We are looking for collaborators who are interested in text mining, natural language processing and machine learning applied to clinical data, especially research groups at universities or others who are interested in trying out methods that will port to human clinical text but who can’t get access to large volumes of clinical data.

Get in touch if this sounds like you! (Noel Kennedy :

Background to our domain problem (we used a CNN to solve it)

I’m going to quote our forthcoming work here because this is a good introduction to the problem domain that we used a CNN to address:

Clinicians write clinical notes which refer to diseases that their patients don't actually have. For some diseases, the vast majority of disease references are written in the notes of patients who don’t have the disease.

Disease references are often negated (“ruled out pancreatitis”), hypothetical (“at risk of developing pancreatitis”), generic (“pancreatitis is more common in men”), historical (“previous history of pancreatitis”), refer to another person (“father had pancreatitis”), hedged (“could be pancreatitis”), or part of a differential diagnosis (“ddx: pancreatitis, gastroenteritis or appendicitis”).

Source: Kennedy et al, 2018 forth coming

So why is this a problem? Well, our epidemiologists perform free text searches, say trying to find patients with pancreatitis by searching for “pancreatitis”. If you executed this search on our 55 million clinical notes you will get a huge number of hits, but because of the phenomena above, not all of those hits are patients who have pancreatitis. Our epidemiologists get thousands of thousands of hits for a term, but then have to read through thousands of patients’ notes where the patients don’t actually have pancreatitis…

We call this problem the false positive (FP) problem because the hits that get returned by our clinical search engine are FPs if the patient doesn’t actually have the disease the epidemiologist is interested in.

We define FP as: a disease reference in a patient's notes iif that patient wasn't diagnosed with the disease at the time the note was written. In comparison, an example true positive (TP) disease reference would be an assertion that the author of the note believes that the patient truly has the disease in question at that time (“patient has pancreatitis”).

Source: Kennedy et al, 2018 forth coming

We used a CNN classifier to determine if a particular disease reference was an FP or a TP. We were pleased with our results and we wanted to get some insight into what our CNN was fitting to in order to get these results. What had the CNN learned about our clinical notes in order to be able to classify a disease reference as being a TP or an FP? We adapted the three methods that gave insight into image-based CNNs to work on our text-based CNN to see if we could interpret what representations the text-based CNN had learned.


You should now have a rough idea about our corpus and classification problem. This will allow you to understand some of the insights gleaned from our CNN next in the series.

Next post : Generating text to fit a CNN

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade