CNN insights: What do convolutional neural networks learn about free text? Part 1 of 7
Exploring representations learned by CNN about free-text
Deep learning techniques ‘learn their own representations’ of the inputs we give them. But what representations do they learn? We have some idea what convolutional neural networks (CNNs also known as convnets) are doing when trained on images but there seems to be less insight into what they are doing with text data. Sure, we know we can investigate the word embedding space but what representations have been learned about sequences of word embeddings, ie phrases and sentences?
This blog series discusses techniques for getting insight into the features your CNN extracts when you train it as a text classifier. This series came out of some work I did with my collaborators on a very large clinical dataset using CNNs as a text classifier. Some of this project was written up as a research paper but there were quite a few ideas that didn’t fit into the paper which I thought were interesting so I decided to present them here instead.
Although the domain insights we obtained using these methods were specific to our particular dataset and classification challenge, the methods are more widely applicable and can be applied by people who want to understand how their CNN is fitting to their dataset.
Series overview
We will take a look at what CNNs do when they are trained to classify images and discuss three methods that explore representations learned by CNNs trained on imaging data. In later posts, we will turn to adapting these three methods to interrogate the representations learned by CNNs about text data. You can also read about the specific insights we got into how a CNN works on our domain problem. In order to give these insights some context, we also introduce our dataset VetCompass™ one of the largest clinical corpus in the world comprising 9.5 million animal records and discuss the domain problem that we trained a CNN to solve for us.
Highlights
To do well in our task, the CNN must fit to diagnostic language in the clinical notes: ie to identify sections of the text where a clinician is writing about making or ruling-out a diagnosis. Two of the techniques that I go into detail about later reveal (at least partly) what the CNN is doing. In short : it does often fit to the most relevant sections in the text.
Here are some example short diagnostic token sequences that the CNN fitted to:
risk of cushings / diabetes
rule out cushings / hypot4
rule out cushings / addisons
risks of cushings / dm
rule out cushings , t4
risk of demodex or sarcoptic
rule out diabetis , cushings
indicative of cushings or addisons
risks eg diabetes , arthritis
risk of spay in season
rule out cushings / hypothyroidism
diseases eg diabetes / hyperthyroid
diseases like cushings or diabetes
risks , diabetes , cushings
rule out chellietella or sarcoptes
too possible cushings or hypothyroidHere is an example visualisation indicating the most relevant diagnostic tokens in the sentence. Don’t worry too much about how to interpret this chart here, but the longer the bars in the chart, the more relevant the CNN thought the token was to a diagnosis of cardiomyopathy:
Series links
Part 1 : Introduction
Part 2 : What do convolutional neural networks learn about images?
Part 3 : Introduction to our dataset and classification problem
Part 4 : Generating text to fit a CNN
Part 5 : Hiding input tokens to reveal classification focus
Part 6 : Scoring token-sequences by their relevance
Part 7 : Series conclusion
Note to implementors
If you want to implement some of the ideas in this series you should probably already know a bit about the following topics before you start (but this isn’t necessary for the general reader)
- You roughly know what word embeddings are eg word2vec and how to create them
- You roughly know how to apply CNNs to text
All my coding examples assume you have a training set and have already built and trained a CNN model. My code examples are written in Keras.
Next part : Part 2 : What do convolutional neural networks learn about images?
