The Similarity of Medical Texts Using the UMLS Ontology
The goal of this article is to experimentally calculate a similarity score of medical texts based on the distance of extracted terms in the UMLS ontology.
In contrast to similarity measures based on unsupervised trained word embeddings, the idea is to use the expert-engineered UMLS. It only takes the similarity of words that are a valid concept in the ontology into account, ignoring position in the text or syntax.
Validation of the results is done via the hand-annotated BIOSSES dataset.
Unified Medical Language System (UMLS)
The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote the creation of more effective and interoperable biomedical information systems and services, including electronic health records. It is developed by the US National Library of Medicine. The Metathesaurus maps ICD-10, MeSH, and SNOMED CT. It contains over 4 million names for nearly 1 million concepts but more importantly 12 million relations among the concepts. Every Concept is related to another concept in the hierarchy via providing information if it is broader, narrower, parent, child, or sibling.
While there are no fees for the UMLS Metathesaurus, SNOMED might require you to get a license. There are no fees if you are from the US or any other IHTSDO Member countries. Otherwise, your fee is calculated according to the World Bank Country Income Categories. If you are a developer, you can also apply for a free license. More details can be found at the National Institute of Health SNOMED page.
UMLS::Interface and UMLS::Similarity
Bridget T. McInnes and Ted Pedersen wrote a paper on different semantic similarity and relatedness measures (Path, LCH, WUP,..) between pairs of biomedical concepts within the UMLS ontology. They further published two amazing open-source Perl modules. UMLS::Interface provides an interface to UMLS installed in a database, while UMLS::Similarity implements the measurements.
I wrote a custom SpaCy pipeline to extract UMLS terms from a text — available here — before, however, for this article, a Python package called quickumls will be used.
The BIOSSES dataset
BIOSSES contains 100 sentence pairs that were selected from the ‘TAC2 Biomedical Summarization Track Training Data Set’. The pairs were annotated by human experts with a similarity score within the range [0,4].
Consider the following example which one expert rated with a perfect 4:
- Hydrolysis of β-lactam antibiotics by β-lactamases is the most common mechanism of resistance for this class of antibacterial agents in clinically important Gram-negative bacteria.
- In Gram-negative organisms, the most common β-lactam resistance mechanism involves β-lactamase mediated hydrolysis resulting in subsequent inactivation of the antibiotic.
It is available for download here.
Unfortunately it is biased towards the highest category and the sample size is relatively small.
Setup
Getting the infrastructure up and running requires the following steps:
- Get a SNOMED-CT license
- Download UMLS
- Setup UMLS using the MetamorphoSys tool to customize the installation and select which vocabularies you want to in-/exclude. This will generate a database load script
- Setup QuickUMLS following the instructions
- Load UMLS into a MySQL database
- Install UMLS::Similarity via CPAN (Do not forget to set up your my.cnf with your credentials upfront)
Most of the process is documented best in the following README. A piece of general advice is to be generous with resources before starting the installation.
Extraction of terms and creation of a distances lookup table
In the following code snippet, we are going to download the dataset from GitHub (meta, texts) into a Pandas DataFrames, extract the UMLS terms, generate a list of potential pairings and export it to a CSV file for further processing.
The following is an example of which terms have been extracted from a text pairing:
Text 1: ‘At the onset of mitosis, LATS2 is activated by phosphorylation and plays important roles in G2/M transition in cultured cells’;
[‘cultured cells’, ‘phosphorylation’, ‘activated’, ‘mitosis’, ‘amitosis’, ‘transition’]
Text 2: ‘Lats2/Kpm is homologous to Lats1 and undergoes cell cycle-dependent phosphorylation’; [‘cell cycle’, ‘phosphorylation’, ‘homologous’]
The exported CSV now contains around 4 million pairings. Lookups require complex calculations and expensive database queries so that we precompute the similarity values using the Perl UMLS::Similarity module. Therefore I wrote the following Perl script, which takes a CSV file and generates one containing the pairings with the different distances measurements.
Now we can load the table with the results and optimize the index for fast access. As different measurements methods return values in different ranges, we scale them to the range [0,1].
Not all concepts provide information containing their relationships, so that we only get around 130 thousand pairings with a valid distance.
Aggregation of similarity scores
After aggregating all required information into an evaluation DataFrame, it is easy to calculate the similarity scores by averaging any distance available in the product of CUIs in both texts. In addition, the number of available distance is stored. The error is then defined to be the absolute of the difference between the expert rating and the calculated score.
Evaluation of results
The LCH measure outperforms the path distance with a mean error of 0.88 compared to approx. 1. The number of CUI pairings that are available for the calculation of the similarity does not seem to have a huge impact on the quality of the result, which needs further investigation into the extracted concepts.
The approach probably needs a lot more fine-tuning, filtering concepts that add little value. Specially when you have cases that do not contain any relevant terms, a different metric is needed. Several experiments that combine UMLS, word embeddings (BioWordVec) and Word Mover’s distance yield better results.
Related papers: