How to use scispaCy for Biomedical Named Entity Recognition, Abbreviation Resolution and link UMLS

Wuraola Oyewusi
Aug 11 · 3 min read

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. https://allenai.github.io/scispacy/

I think scispaCy is interesting and decided to share some part of exploring the library. I hope this makes working with scispaCy easier for someone. Google Colaboratory Notebook for this article can be found here.

As at the time of writing this, scispaCy has two entity mentions models(small and medium),Then four NER models optimized for different kinds of entities.Check here to view models and what entities they work for.

We will explore three models here, one entity mention model en_core_sci_md and two NER models en_ner_bc5cdr_md(for disease and chemical entities) and en_ner_bionlp13cg_md(for cancer,organ,tissue,organism,cell,amino_acid,gene_or_gene_products,anatomical_entities etc)

Install library and models

The test document used in this articles can be found here

Snippet of sample document

Python function display_entities()accepts a model and document to return a displacy image and word entities. The function will be used on three different scispaCy models and the tests document. The function can be adjusted as needed. E.g To view dependency parsing instead of entities use displacy.render(doc,jupyter=True,style='dep')

A python function that displays entities and labels
View of entity mentions image
Word Entity Mention and Label
View of Bionlp13cg Named Entities
Bionlp13cg Named Entities and Label
View of Bc5cdr Named Entities
Bc5cdr Named Entities and Label

The function show_medical_abbreviation() accepts a model and document to return abbreviated words and their resolutions. The function can be adjusted as needed. I set the list so only unique values are returned

A python function that resolves medical abbreviations
Detected Medical abbreviations and their resolution

The function unified_medical_language_entity_linker() accepts a model and document to return information on named entities and links the entity to the unified medical language systems to return Concept Identity Number,Definitions,Aliases and Accuracy score of Named Entity. As at the time of writing this article,this feature in scispaCy is an alpha feature and the entity linker takes a while to load and there are still user warnings but it’s totally worth it and interesting to try out

A python function that links named entities to UMLS database
Bc5cdr Named Entities and UMLS links
Bionlp13cg Named Entities and UMLS links

If you worked through this, I hope you had a great time too and I did a good taking you through scispaCy

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade