How to use scispaCy for Biomedical Named Entity Recognition, Abbreviation Resolution and link UMLS
I think scispaCy is interesting and decided to share some part of exploring the library. I hope this makes working with scispaCy easier for someone. Google Colaboratory Notebook for this article can be found here.
As at the time of writing this, scispaCy has two entity mentions models(small and medium),Then four NER models optimized for different kinds of entities.Check here to view models and what entities they work for.
We will explore three models here, one entity mention model
en_core_sci_md and two NER models
en_ner_bc5cdr_md(for disease and chemical entities) and
en_ner_bionlp13cg_md(for cancer,organ,tissue,organism,cell,amino_acid,gene_or_gene_products,anatomical_entities etc)
The test document used in this articles can be found here
display_entities()accepts a model and document to return a displacy image and word entities. The function will be used on three different scispaCy models and the tests document. The function can be adjusted as needed. E.g To view dependency parsing instead of entities use
show_medical_abbreviation() accepts a model and document to return abbreviated words and their resolutions. The function can be adjusted as needed. I set the list so only unique values are returned
unified_medical_language_entity_linker() accepts a model and document to return information on named entities and links the entity to the unified medical language systems to return Concept Identity Number,Definitions,Aliases and Accuracy score of Named Entity. As at the time of writing this article,this feature in scispaCy is an alpha feature and the entity linker takes a while to load and there are still user warnings but it’s totally worth it and interesting to try out
If you worked through this, I hope you had a great time too and I did a good taking you through scispaCy