Contextual Parser in Spark NLP: Extracting Medical Entities Contextually

Photo by Mark Duffel on Unsplash

This article is part of the series in which we are describing all annotators and main features of the Spark NLP for Healthcare. Each article includes examples and Jupyter Notebook for testing.

Entity extraction in the text

It’s a common task in healthcare to extract different entities from the medical records. It can be the Problem, which patient had, received Treatment and completed Tests, or posology entities like Drugs, Dosage, and so on. This operation allows us to convert raw data to the structured, which can be analyzed with statistical approaches or displayed in the reports.

Entity extractors in Spark NLP

Sometimes in the clinical texts, you need to extract disease codes, dosage, or dates with predefined rules. The first idea, that can come to your mind, why wouldn’t I use simple Regex? Sound logical and easy.

But what if you need to integrate it into the existing pipeline and make some transformations after? Of course, with Spark NLP pipelines you can always finalize the preprocessed text, apply a user-defined function (UDF), and create a new pipeline. But why would we repeat ourselves? There is an easier solution.

Contextual Parser

Spark NLP team created a special annotator for such operation as a part of the pipeline, which will allow you to use regex results in the next transformers (or just extract the information you need).

Regex expression for the ContextualParser is stored in the JSON file. What is nice about it, the format contains not only regex itself, but additional settings, which rather simplifies regex we are using or allows to save time on adding multiple word variations in the dictionary. Dictionary here is the way to reference fixed vocabulary for the match instead of setting regex in the JSON. It is very useful if you don’t feel like writing complex Regex today, but need the working version of the pipeline.

ContextualParserApproach() comes from sparknlp-jsl.annotator class and has the following settable parameters. See the full list here.

  • setJsonPath() -> Sets up location of the JSON file with regex
  • setCaseSensitive() -> optional: Whether to use case sensitive when matching values, default is false
  • setPrefixAndSuffixMatch() -> optional: Defines if we should match both prefix and suffix to annotate the hit
  • setDictionary() -> optional: Sets up path to dictionary file in tsv or csv format

You are working on the statistical report of the cancer disease in the region. As a dataset you have a big scope of clinical reports. Now you want to extract entities like pT1bN0M0, cT4bcN2M1, cT3cN2 and so on (followed by a defined pattern) to calculate the total amount of stages. Example of text you have:

A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or lung. If the primary site is not clearly identified , this case is cT4bcN2M1, Stage Grouping 88. N4 A child T?N3M1  has soft tissue aM3 sarcoma and the staging has been left unstaged. Both clinical and pathologic staging would be coded pT1bN0M0 as unstageable cT3cN2.Medications started.

you define the name of an entity you are extracting, the regex value and the matchScope that will tell the regex whether to make a full match or a partial match:

{
"entity": "Stage",
"ruleScope": "sentence",
"regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]*",
"matchScope": "token"
}

Ignore the ruleScope for the moment, it's always at a sentence level. Which means find a match on each sentence.

The result will be:

expectedResult = ["pT1bN0M0", "T5", "cT4bcN2M1", "T?N3M1", "pT1bN0M0", "cT3cN2.Medications"]

If you are using a matchScope at sub-token level pipeline will output:

expectedResult = ["pT1b", "T5", "cT4bc", "T?", "pT1b", "cT3c"]

We already defined how why you may need the Contextual Parser, here is how you can integrate it into the existing Spark NLP pipeline:

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
stage_contextual_parser = ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity_stage") \
.setJsonPath("data/Stage.json")
parser_pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
stage_contextual_parser])
empty_data = spark.createDataFrame([[""]]).toDF("text")parser_model = parser_pipeline.fit(empty_data)light_model = LightPipeline(parser_model)annotations = light_model.fullAnnotate(sample_text)[0]

Parameter contextLength will tell the maximum distance that prefix or suffix words can be away from the word to match, whereas context are words that must be immediately after or before the word to match.

Another useful feature of the parser id dictionary parameter. To use ut you need to define a set of words that you want to match and the word that will replace this match.

For example, with this definition, you are telling ContextualParser that when words woman, female, and girl are matched those will be replaced by female, whereas man, male, boy and gentleman are matched those will be replaced by male.

female  woman   female  girl
male man male boy gentleman

So, for example for this text:

At birth, the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about seven months, and then the girl grows faster until four years. From then until adolescence no differences in velocity can be detected.

The expected output of the annotator will be:

expectedResult = ["boy", "girl", "girl"]

and replacing words could be extracted from metadata:

expectedMetadata =
[{"field" -> "Gender", "normalized" -> "male", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"},
{"field" -> "Gender", "normalized" -> "female", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"},
{"field" -> "Gender", "normalized" -> "female", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"}]

For the dictionary, you just need to define a csv or tsv file, where the first element of the row is the normalized word, the other elements will be the values to match.

Here is the Jupyter Notebook, where you can find more examples of annotator usage and full pipelines.

Conclusion

In this article, we introduced you to ContextualParserApproach(), one of the annotators, which allows us to extract entities from the text. It’s the simplified annotator, that you can use if you can apply regular expressions to extract your entities and rules are specific. If you need to take into account context information as well I suggest you check our NerDL pretrained models and even train your own model using NerDL annotator.

Here are the links for the articles, that will help you with it. Don’t forget to follow our page and stay tuned!

Problems?

Feel free to ping us on Github or just join our slack channel!

Data scientist and NLP researcher