Unifying Entity Extraction: Combining NER and Regex with Healthcare NLP’s ChunkConverter

Bridging the Gap Between Rule-Based and Model-Based Annotations

Published in

John Snow Labs

6 min readJul 8, 2024

TL;DR: ChunkConverter unifies regex and NER entity extractions in Spark NLP pipelines by converting regex chunks to a standard chunk format with entity labels, enabling integrated downstream processing.

What is ChunkConverter?

ChunkConverter is an annotator in the Healthcare NLP library that allows you to convert chunks from a RegexMatcher annotator into chunks with an entity label in the metadata. For many NLP cases, you need to apply some rule-based approach for entity extraction in addition to pre-trained NER models ( SparkNLP MedicalNerModel using Bidirectional LSTM-CNN architecture and BertForTokenClassification ). This is useful when you want to merge entities identified by NER models with entities found using rule-based regex matching from the RegexMatcher annotator. By converting the regex chunks to a unified chunk format, all identified entities can be treated consistently in the downstream pipeline steps.

When to Use It

You’ll want to use ChunkConverter when you have a pipeline that is extracting entities both from NER models and from regex rules. The NER model chunks and regex chunks will be in different formats initially. ChunkConverter standardizes the regex chunks to the same chunk format as the NER extractions, allowing you to handle all entities uniformly after this step.

Usage

Let’s start with a quick Spark NLP introduction before moving on to the ChunkConverter’s usage details.

Spark NLP & LLM

The Healthcare Library is a powerful component of John Snow Labs’ Spark NLP platform, designed to facilitate NLP tasks within the healthcare domain. This library provides over 2,200 pre-trained models and pipelines tailored for medical data, enabling accurate information extraction, NER for clinical and medical concepts, and text analysis capabilities. Regularly updated and built with cutting-edge algorithms, the Healthcare library aims to streamline information processing and empower healthcare professionals with deeper insights from unstructured medical data sources, such as electronic health records, clinical notes, and biomedical literature.

John Snow Labs’ GitHub repository serves as a collaborative platform where users can access open-source resources, including code samples, tutorials, and projects, to further enhance their understanding and utilization of Spark NLP and related tools.

John Snow Labs also offers periodic certification training to help users gain expertise in utilizing the Healthcare Library and other components of their NLP platform.

John Snow Labs’ demo page provides a user-friendly interface for exploring the capabilities of the library, allowing users to interactively test and visualize various functionalities and models, facilitating a deeper understanding of how these tools can be applied to real-world scenarios in healthcare and other domains.

Setup

To set up the John Snow Labs Healthcare NLP library, you can follow the detailed instructions provided in their official documentation here.

Additionally, you can refer to the Healthcare NLP GitHub repository, which includes sample notebooks. Each notebook contains an initial part that demonstrates how to set up Healthcare NLP on Google Colab, under a section named “Colab Setup”.

Below, you’ll find the essential code snippets from the “Colab Setup” section to help you get started quickly:

# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

ChunkConverter

Input/Output Annotation Types

Input: DOCUMENT, CHUNK
Output: CHUNK

Let’s look at an example using ChunkConverter to convert regex chunks to entity chunks:

# Define regex rules
rules = r'''
\b[A-Z]+(?\s+[A-Z]+)*:\b, SECTION_HEADER  
'''

# Write rules to file 
with open('regex_rules.txt', 'w') as f:
    f.write(rules)

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

regex_matcher = nlp.RegexMatcher()\
    .setInputCols("sentence")\
    .setOutputCol("regex")\
    .setExternalRules(path="./regex_rules.txt", delimiter=","  )

chunkConverter = medical.ChunkConverter()\
    .setInputCols("regex")\
    .setOutputCol("chunk")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        regex_matcher,
        regex_matcher,
        chunkConverter,
    ])

text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE:  Excisional biopsy of right cervical lymph node.
ANESTHESIA:  General endotracheal anesthesia.
Specimen:  Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS:  None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS:  Please see anesthesia report.
URINE OUTPUT:  None recorded during the case.
INDICATIONS FOR PROCEDURE:  This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL:  The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result_df = result.select(F.explode(F.arrays_zip(result.regex.result,
                                                 result.regex.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex"),
                          F.expr("cols['1']").alias("metadata"))

result_df.show(50, truncate=False)

#OUTPUT:
+--------------------------+----------------------------------------------------------+
|regex                     |metadata                                                  |
+--------------------------+----------------------------------------------------------+
|POSTOPERATIVE DIAGNOSIS:  |{identifier -> SECTION_HEADER, sentence -> 0, chunk -> 0} |
|PROCEDURE:                |{identifier -> SECTION_HEADER, sentence -> 1, chunk -> 0} |
|ANESTHESIA:               |{identifier -> SECTION_HEADER, sentence -> 2, chunk -> 0} |
|EBL:                      |{identifier -> SECTION_HEADER, sentence -> 4, chunk -> 0} |
|COMPLICATIONS:            |{identifier -> SECTION_HEADER, sentence -> 5, chunk -> 0} |
|FINDINGS:                 |{identifier -> SECTION_HEADER, sentence -> 6, chunk -> 0} |
|FLUIDS:                   |{identifier -> SECTION_HEADER, sentence -> 7, chunk -> 0} |
|URINE OUTPUT:             |{identifier -> SECTION_HEADER, sentence -> 8, chunk -> 0} |
|INDICATIONS FOR PROCEDURE:|{identifier -> SECTION_HEADER, sentence -> 9, chunk -> 0} |
|PROCEDURE IN DETAIL:      |{identifier -> SECTION_HEADER, sentence -> 13, chunk -> 0}|
+--------------------------+----------------------------------------------------------+

In the provided pipeline above:

\b[A-Z]+(?\s+[A-Z]+)*:\b, SECTION_HEADER: This pattern matches capitalized phrases followed by a colon, such as "DIAGNOSIS:", "PROCEDURE:", etc. It is tagged as SECTION_HEADER.

The code writes this pattern to a file named regex_rules.txt.The rule is included using setExternalRulesin the RegexMatcherannotator.

A sample text was used to evaluate the ChunkConverter annotator’s performance. Subsequently, the dataframe was fitted and transformed to extract the results.

The results were exploded into a dataframe with columns — regex and metadata.

Conclusion

ChunkConverter provides a simple but powerful way to unify entities extracted by different methods in Spark NLP pipelines. By converting regex chunks to a standard chunk format with entities, you can seamlessly integrate rule-based and model-based entity extractions for comprehensive analysis.