Advanced Phrase Matching in Clinical Texts in Healthcare NLP

Leveraging TextMatcherInternal for Precise Phrase Matching in Healthcare Texts

Gökhan TÜRER
John Snow Labs
5 min readJul 26, 2024

--

Photo by Bermix Studio on Unsplash

TL;DR: The TextMatcherInternalannotator in Healthcare NLP is a powerful tool for exact phrase matching in healthcare text analysis. We’ll cover its key parameters and demonstrate its implementation through a practical example. Additionally, we’ll explain how to save and load the model for future use. This annotator can significantly enhance tasks such as clinical document analysis and patient data management.

In the ever-evolving field of healthcare, accurate text analysis can significantly enhance data-driven decisions and patient outcomes. One of the powerful tools in Spark NLP is the TextMatcherInternalannotator, designed to match exact phrases in documents. In addition to the variety of Named Entity Recognition (NER) models available in our Models Hub, such as the Healthcare NLP MedicalNerModel utilizing Bidirectional LSTM-CNN architecture and BertForTokenClassification, our library also features robust rule-based annotators including ContextualParser, RegexMatcher, EntityRulerInternal, and TextMatcherInternal. In this blog post, you’ll learn how to use this annotator effectively in your healthcare NLP projects.

What is TextMatcherInternal?

The TextMatcherInternalannotator is a versatile tool that matches exact phrases provided in an external file against a document. It offers several configurable parameters, allowing you to tailor its functionality to specific use cases.

Parameters:

  • setEntities: Sets the external resource for the entities.
  path : str
Path to the external resource
read_as : str, optional
How to read the resource, by default ReadAs.TEXT
options : dict, optional
Options for reading the resource, by default {"format": "text"}
  • setCaseSensitive: Determines whether the match is case-sensitive (Default: True).
  • setMergeOverlapping: Decides whether to merge overlapping matched chunks (Default: False.)
  • setEntityValue: Sets the value for the entity metadata field if not specified in the file.
  • setBuildFromTokens: Indicates whether the annotator should derive the CHUNK from TOKEN.
  • setDelimiter: Specifies the delimiter between phrases and entities.

Setting Up Spark NLP Healthcare Library for Enhanced Text Matching

First, you need to set up the Spark NLP Healthcare library. Follow the detailed instructions provided in the official documentation.

Additionally, refer to the Healthcare NLP GitHub repository for sample notebooks demonstrating setup on Google Colab under the “Colab Setup” section.

# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()
from johnsnowlabs import nlp, medical
from johnsnowlabs import nlp, medical
import pandas as pd

Using TextMatcherInternal for Drug Name Recognition in Clinical Texts

Let’s dive into a practical example using the TextMatcherInternal annotator.

Example: Matching Drug Names in Clinical Text

In this example, we’ll match specific drug names mentioned in clinical text using the TextMatcherInternal annotator.

First, create a source file containing all the chunks or tokens you aim to capture. In the example below, we use # as a delimiter to separate the label from the entity. Therefore, set the parameter as setDelimiter(‘#’). (You can include multiple entities within the same file.)

matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
f.write(matcher_drug)

Next, we define the Spark NLP pipeline:

# Show results
result.select(F.explode(F.arrays_zip(
result.matched_text.result,
result.matched_text.begin,
result.matched_text.end,
result.matched_text.metadata,)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']").alias("begin"),
F.expr("cols['2']").alias("end"),
F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

Output:

This code creates a CSV file containing drug names and their labels, then sets up a pipeline with a DocumentAssembler, Tokenizer, and TextMatcherInternal. The pipeline processes a sample clinical text, extracting and displaying the matched drug names along with their positions and labels.

TextMatcherInternalModel :

After building and training the TextMatcherInternal model, you can save it for future use. Let’s rebuild one of the examples we’ve done before and save it.

entityExtractor = TextMatcherInternal() \
.setInputCols(["document", "token"]) \
.setEntities("matcher_drug.csv") \
.setOutputCol("matched_text")\
.setCaseSensitive(False)\
.setDelimiter("#")

mathcer_pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

Saving the Model

matcher_model.stages[-1].write().overwrite().save("matcher_internal_model")

Loading and Using the Saved Model

To load and use the saved model, you can use the TextMatcherInternalModel via load.

entity_ruler = TextMatcherInternalModel.load('/content/matcher_internal_model') \
.setInputCols(["document", "token"]) \
.setOutputCol("matched_text")\
.setCaseSensitive(False)\
.setDelimiter("#")

pipeline = Pipeline(stages=[documentAssembler,
tokenizer,
entity_ruler])

pipeline_model = pipeline.fit(data)
result = pipeline_model.transform(data)

This code snippet demonstrates how to save the trained TextMatcherInternal model to disk and load it back for further use. The loaded model is then used in a new pipeline to process clinical text.

Enhancing Healthcare NLP Projects with TextMatcherInternal

The TextMatcherInternalannotator in Spark NLP is a powerful tool for analyzing healthcare texts. It allows for precise and efficient matching of predefined phrases within a given document. By leveraging its flexible parameters, such as case sensitivity and delimiter settings, you can significantly enhance the accuracy and efficiency of your NLP projects. This flexibility makes it ideal for various healthcare applications, from identifying drug names to extracting specific medical conditions.

For more comprehensive examples, detailed documentation, and further insights into utilizing TextMatcherInternal, visit the Spark NLP Workshop.

--

--