How to use scispaCy for Biomedical Named Entity Recognition, Abbreviation Resolution and link UMLS

scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. https://allenai.github.io/scispacy/

I think scispaCy is interesting and decided to share some part of exploring the library. I hope this makes working with scispaCy easier for someone. Google Colaboratory Notebook for this article can be found here.

As at the time of writing this, scispaCy has two entity mentions models(small and medium),Then four NER models optimized for different kinds of entities.Check here to view models and what entities they work for.

We will explore three models here, one entity mention model en_core_sci_md and two NER models en_ner_bc5cdr_md(for disease and chemical entities) and en_ner_bionlp13cg_md(for cancer,organ,tissue,organism,cell,amino_acid,gene_or_gene_products,anatomical_entities etc)

Image for post
Image for post
Install library and models

The test document used in this articles can be found here

Image for post
Image for post
Snippet of sample document

Python function display_entities()accepts a model and document to return a displacy image and word entities. The function will be used on three different scispaCy models and the tests document. The function can be adjusted as needed. E.g To view dependency parsing instead of entities use displacy.render(doc,jupyter=True,style='dep')

A python function that displays entities and labels
Image for post
Image for post
View of entity mentions image
Image for post
Image for post
Word Entity Mention and Label
Image for post
Image for post
View of Bionlp13cg Named Entities
Image for post
Image for post
Bionlp13cg Named Entities and Label
Image for post
Image for post
View of Bc5cdr Named Entities
Image for post
Image for post
Bc5cdr Named Entities and Label

The function show_medical_abbreviation() accepts a model and document to return abbreviated words and their resolutions. The function can be adjusted as needed. I set the list so only unique values are returned

Image for post
Image for post
A python function that resolves medical abbreviations
Image for post
Image for post
Detected Medical abbreviations and their resolution

The function unified_medical_language_entity_linker() accepts a model and document to return information on named entities and links the entity to the unified medical language systems to return Concept Identity Number,Definitions,Aliases and Accuracy score of Named Entity. As at the time of writing this article,this feature in scispaCy is an alpha feature and the entity linker takes a while to load and there are still user warnings but it’s totally worth it and interesting to try out

Image for post
Image for post
A python function that links named entities to UMLS database
Image for post
Image for post
Bc5cdr Named Entities and UMLS links
Image for post
Image for post
Bionlp13cg Named Entities and UMLS links

If you worked through this, I hope you had a great time too and I did a good taking you through scispaCy

Written by

Data Scientist | Pharmacist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store