CNN insights: Scoring token-sequences by their relevance. Part 6 of 7

Method 3 : rank input sentences by a filter’s activation function score

Noel Kennedy

10 min readMay 21, 2018

Series links

Part 1 : Introduction

Part 2 : What do convolutional neural networks learn about images?

Part 3 : Introduction to our dataset and classification problem

Part 4 : Generating text to fit a CNN

Part 5 : Hiding input tokens to reveal classification focus

Part 6 : Scoring token-sequences by their relevance

Part 7 : Series conclusion

This method is somewhat similar to the sentence generation method. It works as follows : We choose a particular filter that we want to examine but, instead of generating new synthetic sentences to maximise the activation score of a filter, we will feed real sentences from our training set into the network and simply record the filter’s activation scores for every sentence. We will end up with each sentence in the training set having an activation score for each filter. Then we take the highest scoring sentences as the best examples of what the filter is looking for.

This method avoids the pitfall of the generative approach because the ranking will be of real token sequences, not the machine-generated sentences we had before.

Experimental setup

Fit the CNN to the training corpus

Feed each sentence from the training corpus into the network and perform a forward pass
Record the activation scores of each filter for each sentence
For each filter, sort all the sentences in the training set by the filter’s activation score descending so the most strongly activating sentences are first
Take the top n sentences from the sorted list and look for patterns to gain insight into what the filter has fitted to

Code

Results

This method works really well and has resulted in some really interesting insight into our dataset.

Recall our CNN was trained to disambiguate disease references in veterinary clinical data. Often clinicians write about diseases in patient notes when that patient doesn’t have the disease. We wanted to train a CNN to classify disease references according to whether the patient was diagnosed with that disease or not.

I applied the sentence ranking method which is the subject of this post and here is what I found.

Notes on selection of examples : I’ve cherry picked some filters to demonstrate our insights into our dataset and how the CNN has learned to disambiguate disease references. I haven’t cherry-picked the example tokens sequences; I just de-duped them for clarity as often precisely the same phrase will occur multiple times since these phrases also appeared multiple times in the source data. I also removed any potentially identifying information.

Filters which represented differential diagnoses

Clinicians sometimes list a number of potential diagnoses in a list called a differential diagnosis list. These diagnoses are the short-list of diagnoses that the clinician is thinking might explain the patient’s symptoms but they are (most often) mutually exclusive possibilities. It follows by definition that patients can’t have been diagnosed with any diseases mentioned in their differential diagnosis so the performance of our classifier on these references is critical to overall classification success.

We found multiple filters which fitted to the tokens in differential diagnosis lists and were most strongly activated by them.

Note the filters below are activating on punctuation as well as diagnostic terms. It seems that surrounding punctuation is a strong signal that a disease reference is not diagnosed.

Filter 1:
diabetes , epi , hac
psychogenic , di , cushings
pancreatitis , epi , cushings
endocrine eg cushings or hypot4
malassezia , hypothyroidism 
diabetes , thyroid , cushings
endocrinopathy -LRB- thyroid , cushings
malabsorption , epi , cushings
endocrinopathy -LRB- cushings , addisons
dz , hypot4 , cushings
diabetes , cushings , hypot4
endocrine -LRB- hypothyroidism , cushings
malabsorption , epi , cushings
diabetes , thyroid , cushings
alopecia , hypot4 , cushings
hypothyroidism , cushings , diabetis
demodex , hypothyroid , cushings
hypothyroidism , cushings , diabetes
hypothyroidism , cushings , addisionsFilter 2:
, hypert4 , neoplasia ?
ddx cushings , liver diz
: hypert4 , hepatic diz
: hypert4 , hepatic diz
, cushings , hypot etc
- cushings / hypot ?
-LRB- cushings , hypot4 etc
, diabetes , hypert ?
was haemolysis yestersay iatrogenic ?
- hypot4 , pss ?
, hepatic , endocrinopathy etc
, cushings , hypot4 ?
, dm , neoplasia ?
: hypert4 , hepatic diz
: hypert4 , hepatic diz
- hypot4 , pss ?
- renal , renal ?
be hypert or hcm ?
, dm , gastritis ?
to hyperglycaemia , neoplasia ?Filter 3:
diabetes , hyperthyroidism from ddx
diabetes , ibd etc etc
diabetes , hepatopathy , neoplasia
demodicosis / dermatophytosis / neoplasia
demodex , scabies , etc
diabetes / calcinosis cutis etc
demodex / sarcoptes / etc
cushings / colangiohepatitis / neoplasia
demodicosis / dermatophytosis / sarcoptes
demodicosis / dermatophytosis / follicular
demodectic / sarcoptic mange ?
hyperthyroidism , neoplastic etc etc
demodecosis / scabies / dermatophtosis
demodex , dermatophytosis , hormonal
demodecosis / dermatophytosis / allergy
sarcoptes , dermatophytosis , etc
cushings , hypothyroidism , neoplasia
demodex , scabies ? ?
demodecosis , epitheliotrophic lymphoma etc
demodex vs dermatophytosis , poss

Filters that fitted to a mixture of differential diagnoses and hypotheticals / negations

Negations are easy to understand; the clinician states that the patient doesn’t have the disease. Hypothetical or conditional references to disease are made when a clinician makes a disease reference in a context where the disease can be diagnosed only conditionally on some event yet to happen.

Semantically these fragments are a mixed bunch but they are also all certainly useful features for classifying if a patient has the disease or not.

Filter 1:
risk of cushings / diabetes
rule out cushings / hypot4
rule out cushings / addisons
endocrine eg cushings or hypot4
cushings , diabetes or hypothyroid
risks of cushings / dm
infectious , cushings , hypot
hormonal -LRB- cushings , hypot4
rule out cushings , t4
risk of demodex or sarcoptic
rule out diabetis , cushings
indicative of cushings or addisons
risks eg diabetes , arthritis
risk of spay in season
rule out cushings / hypothyroidism
diseases eg diabetes / hyperthyroid
diseases like cushings or diabetes
risks , diabetes , cushings
rule out chellietella or sarcoptes
too possible cushings or hypothyroidFilter 2:
unlikley hypot , unlikley cushings
hypothyroid , cushings ...
eg cushings or hypot4 .
unlikely cushing 's adv o
iatrogenic cushings , adv re
, cushings , hypot etc
poss cushings and poss pu
underlying cushings , hypothroidism ?
poss cushings also ? ?
 cushings / primary liverdz
eg diabetes , cushings etc
ddx cushings , neoplasia ,
, cushings , hypot4 ?
o cushings , warned re
underlying cushings , hypothroidism ?
uncontrolled cushings most likely reason
poss hypot4 ? ? ?
incase demodex etc advsied re
poss cruciate , arthritic etc
poss cruciate , adv sedFilter 3:
no tibial thrust ,
present , tibial thrust ,
rule out cushings / addisons
, no tibial thrust /
but no tibial thrust and
draw on tibial thrust test
rule out meniscus disease ,
reveal advanced stifle osteoarthritis ,
rule out cushings / hypothyroidism
rule out elbow dysplasia .
rule out elbow dysplasia .
with positive tibial thrust ,
rule out elbow dysplasia /
rule out elbow dysplasia xx:xx:xx
rule out elbow dysplasia and
rule out stifle / other
detect some tibial thrust but
evidence of stifle effusion ;
potential mild stifle instability -

Filters that fitted to specific health care findings

These are filters that seemed to have tightly fitted specific processes or findings rather than being generically useful to all diseases.

Filter 1:
, fna lump on elbow
noticed a lump under elbow
2cm by 1x2 l elbow
small scab just below elbow
small wart behind l elbow
removed small wart rf elbow
soft tissue shoulder / elbow
mainly of shoudler / elbow
hand side just behind elbow
large s / c elbow
open lumpy mass on elbow
lumps 1 behind each elbow
small wound behind l elbow
graze wound lat l elbow
excema lesion caudal lf elbow
near the axilla / elbow
noticed red area on elbow
, lipoma behind l elbow
chest wall behind r elbow
3cm diameter behind r elbowFilter 2:
patellar reflex bilat nad
/ tibial reflex ok /
/ tibial reflex ok /
hl patella reflex ok /
/ patella reflexes ok /
, patella reflex ok ,
rule out demodex / sarcoptes
patellar reflex bilat nad
patella reflex normal .
patella reflex normal /
rule out demodex / sarcoptes
rule out demodex / sarcoptes
patellar reflexes ok .
patella reflex normal /
rule out demodecosis / dermatophytosis
incase underlying demodex / sarcoptes
risk develop dry eye -RRB-
, patella reflex ok ,
patellar reflex ok on
, patella reflex norm .Filter 3:
thinking should delay spay until
would she consider spey ?
adv re lap spay at
wants to do spay at
discuss diet spay etc
nails / discussed spay incl
neutering - wil spay at
sure would prefer spay sooner
season , adv spay in
does want lap spay will
season , adv spay in
up about possible spay as
vaccines advise re spay /
breed - discussed spay .
wanted to book spay asap
deafness - adv spay not
worming 3mthly adv spey 3
2w , discussed spay etc
classes , adv spay in
date for the spay .

Insurance claims

If an insurance claim is made for a disease it is very strong evidence that it was diagnosed. A lot of filters learned this knowledge and they fitted to various aspects of the insurance claim process.

Filter 1:
diabetes cont
diabetes claim
diabetes cont
ins chq
ins chq
diabetes contFilter 2:
insurance claim completed
insurance chq received
insurance claim completed
insurance cheque paid
insurance claim arthritus
insurance claim arthritus
insurance claim illness
ins claim hypertension
insurance claim allergies
insurance claim invoices
insurance claim arthritus
insurance claim allergies
insurance claim diabetes
insurance claim arthritus
insurance claim completed
insurance claim arthritusFilter 3:
cont on insulin 2iu bid
claim continuation diabetic ketoacidosis xx/xx/xx
o gives insulin 1iu bid
restart insulin @ 1iu
sent continuation diabetes dates claimed
continuing with insulin 2iu bid
sent continuation diabetes dates claimed
continuation patellar luxation dates claimed
sent continuation diabetes dates claimed
> give 1iu neutral insulin
on insulin pzi 3iu sid
claim dry eye sicca xx/xx/xx
payment for optimmune px optimmune
onset keratoconjunctivitis sicca left eye
petplan cont keratoconjunctivitis sicca xx/xx/xx
bilateral keratoconjunctivitis sicca bilateral keratomalacia
increase insulin 3iu bid
[YEAR] history keratoconjunctivitis sicca keratomalacia
bilateral keratoconjunctivitis sicca bilateral keratomalacia

Filters that fitted to specific diseases or phrases

Some filters just pick out individual words or n-grams and ignored the surrounding text. Presumably these filters were useful in combination with representations produced by other filters.

Filter 1:
claimed
claimed
claimed
claimed
claimed
claimed
claimed
claimed
[etc etc]Filter 2:
diabetes mellitus
diabetes mellitus
diabetes mellitus
diabetes mellitus
diabetes mellitus
diabetes mellitus
diabetes mellitus
[etc etc]Filter 3:
worming reminder - skipped on
worming reminder - skipped on
worming reminder - skipped on
worming reminder - skipped on
worming reminder - skipped on
worming reminder - skipped on
worming reminder - skipped on
[etc etc]

Can we use this method to interpret deeper filters?

All the examples shown above were taken from filters that were in the first convolutional layer. The method works very well on filters in this layer as it is easy to find a direct relationship between the activation score and a particular sequence of tokens in the sentence. Deeper filters in a CNN trained on text should fit to more abstract or complex representations of sentences. It would be very interesting if we could interpret what these filters have learned about their input sentence. Unfortunately, this method doesn’t work on deeper filters in this particular CNN (although if I changed the architecture of the network it might work).

The architecture of the CNN used in this series doesn’t have a direct relationship between filters in deeper layers and shorter n-gram sequences. The deeper convolutional layers in our CNN activate over the whole sentence; they don’t score each subsequence of tokens within the sentence in the same way that the first convolutional layer does. When we lose the relationship between a filter and a particular sequence of tokens and replace it with a relationship between a filter and a sentence, it seems that we lose the ability to interpret what the filter is fitting to.

The reason there is no direct relationship between token sequence and deeper filters in our particular CNN is because after each of our filter laters, we have a ‘max pooling’ layer. The max pooling layers throw away positional information so we don’t know which tokens in the sentence activated the filters the most strongly, only which sentences activated it the most strongly. The sole exception is the first filter layer because this was the only filter layer not preceded by a max-pooling layer in our CNN architecture.

Never the less, I adapted this method to work on deeper filters in our CNN by simply listing out the max-activating sentences rather than the max-activating token sequences. I wasn’t able to interpret what the filters were fitting to at all: ie the top-n sentences per filter seemed to have nothing in common that I could interpret.

It would be interesting to remove the max-pooling layers and to re-run the experiment to see if the deeper filters fitted to abstract (but interpretable) concepts in our dataset.

Conclusion

This method is pretty useful in interrogating what the filters in a CNN have been fitted to. True, the method produces a bit of noise in that the semantics of the phrases that an individual filter fits to are not always coherent to our minds at least, but we don’t really require machines to reason like us or use a categorisation system that makes sense to us either. We can certainly see that the machine has learned several effective heuristics directly from the data which have allowed it to get a good overall performance in our classification task.

Next post : Part 7 : Series conclusion