Word Embedding Comparison for Disease Named Entity Recognition
This is a quick comparison of word embeddings for a Named Entity Recognition (NER) task with diseases and adverse conditions. It is not in any way exhaustive and motivated primarily by wanted to try the ELMo embeddings that have gained popularity this year in a health natural language processing task.
Named Entity Recognition:
NER can be very useful in a lot of NLP tasks. One most people can instantly understand to is being able to distinguish what type of apple is meant Apple’s stock fell today from Apples are a great snack. NER is also very good at deciding if a few words Jennifer Love Hewitt make up a couple of words that could be first names but are in fact one name.
There are vast treasure troves of text healthcare data and NER can be useful here also. It can extract and correctly classify phrases of high-risk diseases and symptoms from quickly written doctor notes. Acronyms are always a challenge, but we might be able to decide if At high-risk for CA written in the notes meant the state or Cancer and carcinoma a skin-related cancer.
Data:
Disease and Adverse Effects NER dataset that I used: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpus-for-disease-names-and-adverse-effects.html
Embeddings:
The only word embeddings that wasn’t trained on wikipedia is some form are the EHR (Electronic Health Record) embeddings. The amount of words and dimension of word vectors varies quit a bit and of course affects the performance.
So what are ELMo embeddings and why are they special? Unlike most of the widely used embedding methods, ELMo (Embeddings from Language Models) tokens are contextualized by the sentence they are in, not just the tokens to the left and right. As mentioned in the ELMo paper, there are similar competing algorithms such as context2vec and coVe both of which the authors of ELMo shows outperforms where direct comparison is possible. Not mentioned in the paper but important to mention is Facebooks InferSent and Google’s Universal Sentence Encoder. Although they focus on sentences, these last two are not focused on the word-level embeddings that ELMo is. Basically, the general idea is to get at the lexical layers that lay hidden within words.
Elmo Embeddings: https://allennlp.org/elmo
GloVe Embeddings (Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors): https://nlp.stanford.edu/projects/glove/
Results:
All scores reported are the F1-Micro on BIO format for the Disease and Adverse Effect Entities. Essentially both entities are scored if they are recognized as a whole. E.g. if heart failure atrial fibrillation chf [Disease] is but not if heart failure atrial fibrillation.
F1-Micro is the harmonic mean of a micro-averaged precision and recall (see below). It is a reasonable (more on that below) choice as a metric as it accounts for the class imbalance of Adverse and Disease entities and we don’t have a particular metric needed to optimize for:
Micro-Precision: (TP1+TP2)/(TP1+TP2+FP1+FP2)
Micro-Recall: (TP1+TP2)/(TP1+TP2+FN1+FN2)
ELMo Embeddings (5.5b,200d): 0.779 ± 0.02
EHR/Biomedical Text Embeddings(approx 3b words,w2v cbow, 200d): 0.493 ± 0.05
GloVe (42b,300d): 0.811± 0.04
GloVe (6b,50d): 0.750±0.04
GloVe (6b,100d): 0.780 ± 0.01
GloVe (6b, 200d): 0.804± 0.04
GloVe (6b, 300d): 0.816 ± 0.03
FastText (16bn, 300d): 0.791 ± 0.05
The results (if they show anything) seem to suggest the comparison is not a fair one. That being said, the most surprising perhaps for the results was the very poor performance of EHR embeddings. It perhaps highlights the need for a much larger corpus. Also, these were trained using the cbow word2vec method which would not be the first choice I would use for rare disease words. This is likely reflected when compared to the the GloVe which have the ability to weight rare words through its co-occurrence frequency. The AllenNLP website host of the ELMo embeddings does point out that they left out a comparison to GloVe because they didn’t feel they were equivalent comparisons.
All that being said, F1 is a “safe place” for a data scientist, but it should be pondered for your NER task. Boundary errors as it turns out are one of the biggest sources of error in biological applications. Optimizing for F1 might cause us to ignore Left flank because the phrase was marked as a location when tagging flank as location is still a partial and significant success. Labeling error is like the most significant concern.
Another consideration is the potential sparsity of entities for the corpus you are training on. Most openly available biomedical text datasetsets are based on research articles or even just the abstracts which are much more densely packed with objects that would be identified as a biomedical concept. Also the tone for these datasets are much more academic and may not capture the indications for certain disease that may be important.
Edit: As I was writing this post, I came across Facebook’s Meta-embeddings which is a mechanism to determine which embeddings are most effective in your prediction task. Paper here. Perhaps one of the most interesting uses of this ensemble-type approach is being able to look at the variation of specialized words between embeddings. The authors make an argument that the modeler should not be choosing their embeddings in the first place, but instead leave it to objective rigor of DME (Dynamic Meta-Embeddings).
Another thing to keep in mind is many of these general purpose, broad labels do not clearly define what data cleansing, tokenization and what particularly they were optimizing for.
Further Reading:
If you want to read more about the best and latest in word embeddings I found this one helpful and relevant.
If you wanted to get a deeper dive on where things were headed last year: http://ruder.io/word-embeddings-2017/
The paper that goes with the NER dataset used in this post: Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature
Thanks for reading! Thoughts? Comments?