What is New in Spark NLP for Healthcare v3.4.1?

Muhammet S.
spark-nlp
Published in
9 min readFeb 14, 2022

Spark NLP for Healthcare v3.4.1 is out! Let's check together what is released in this new version.

Highlights

  • Brand new Spanish deidentification NER models
  • Brand new Spanish deidentification pretrained pipeline
  • New clinical NER model to detect supplements​
  • New EntityChunkEmbeddings annotator
  • New RxNorm sentence entity resolver model
  • New MedicalBertForSequenceClassification annotator
  • New MedicalDistilBertForSequenceClassification annotator
  • New MedicalDistilBertForSequenceClassification and MedicalBertForSequenceClassification models
  • Redesign of the ContextualParserApproach annotator
  • getClasses method in RelationExtractionModel and RelationExtractionDLModel annotators
  • Label customization feature for RelationExtractionModel and RelationExtractionDL models
  • useBestModel parameter in MedicalNerApproach annotator
  • Early stopping feature in MedicalNerApproach annotator
  • Multi-Language support for faker and regex lists of Deidentification annotator
  • Spark 3.2.0 compatibility for the entire library
  • Saving visualization feature in spark-nlp-display library
  • Deploying a custom Spark NLP image (for opensource, healthcare, and Spark OCR) to an enterprise version of Kubernetes: OpenShift
  • New speed benchmarks table on databricks
  • New & Updated notebooks
  • List of recently updated or added models

Brand New Spanish Deidentification NER Models

We trained two new NER models to find PHI data (protected health information) that may need to be deidentified in Spanish. ner_deid_generic and ner_deid_subentity models are trained with in-house annotations. Both also are available for using Roberta Spanish Clinical Embeddings and sciwiki 300d.

  • ner_deid_generic : Detects 7 PHI entities in Spanish (DATE, NAME, LOCATION, PROFESSION, CONTACT, AGE, ID).
  • ner_deid_subentity : Detects 13 PHI sub-entities in Spanish (PATIENT, HOSPITAL, DATE, ORGANIZATION, E-MAIL, USERNAME, LOCATION, ZIP, MEDICALRECORD, PROFESSION, PHONE, DOCTOR, AGE).

Example :

...
embeddings = WordEmbeddingsModel()\
.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

deid_ner = MedicalNerModel()\
.pretrained("ner_deid_generic", "es", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")

deid_sub_entity_ner = MedicalNerModel()\
.pretrained("ner_deid_subentity", "es", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_sub_entity")
...

text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020
y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.."""
result = model.transform(spark.createDataFrame([[text]], ["text"]))
Spanish deidentification model results

Brand New Spanish Deidentification Pretrained Pipeline

We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Spanish medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake, or obfuscate the following entities: AGE, DATE, PROFESSION, E-MAIL, USERNAME, LOCATION, DOCTOR, HOSPITAL, PATIENT, URL, IP, MEDICALRECORD, IDNUM, ORGANIZATION, PHONE, ZIP, ACCOUNT, SSN, PLATE, SEX and IPADDR.

from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample_text = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910."""result = deid_pipe.annotate(text)print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
Spanish Deidentification Pretrained Pipeline Result

New Clinical NER Model to Detect Supplements

We are releasing ner_supplement_clinical model that can extract the benefits of using drugs for certain conditions. It can label detected entities as CONDITION and BENEFIT. Also, this model is trained on the dataset that is released by Spacy in their HealthSea product. Here is the benchmark comparison of both versions:

Spark NLP vs Spacy-HealthSea benchmark

Example :

...
clinical_ner = MedicalNerModel()\
.pretrained("ner_supplement_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_tags")
...

results = ner_model.transform(spark.createDataFrame([["Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth."]], ["text"]))
ner_supplement_clinical results

New RxNorm Sentence Entity Resolver Model

sbiobertresolve_rxnorm_augmented_re : This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes without specifying the relations between the entities (relations are calculated on the fly inside the annotator) using sbiobert_base_cased_mli Sentence Bert Embeddings (EntityChunkEmbeddings).

Example :

...
rxnorm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
.setInputCols(["entity_chunk_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
...

New EntityChunkEmbeddings Annotator

We have a new EntityChunkEmbeddings annotator to compute a weighted average vector representing entity-related vectors. The model's input usually consists of chunks of recognized named entities produced by MedicalNerModel. We can specify relations between the entities by the setTargetEntities() parameter, and the internal Relation Extraction model finds related entities and creates a chunk. Embedding for the chunk is calculated according to the weights specified in the setEntityWeights() parameter.

For instance, the chunk warfarin sodium 5 MG Oral Tablet has DRUG, STRENGTH, ROUTE, and FORM entity types. Since DRUG label is the most prominent label for resolver models, now we can assign weight to prioritize DRUG label (i.e {"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2} as shown below). In other words, embeddings of these labels are multipled by the assigned weights such as DRUG by 0.8.

For more details and examples, please check Sentence Entity Resolvers with EntityChunkEmbeddings Notebook in the Spark NLP workshop repo.

Example :

...

drug_chunk_embeddings = EntityChunkEmbeddings()\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("drug_chunk_embeddings")\
.setMaxSyntacticDistance(3)\
.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})\
.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})

rxnorm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
.setInputCols(["drug_chunk_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")

rxnorm_weighted_pipeline_re = Pipeline(
stages = [
documenter,
sentence_detector,
tokenizer,
embeddings,
posology_ner_model,
ner_converter,
pos_tager,
dependency_parser,
drug_chunk_embeddings,
rxnorm_resolver])

sampleText = ["The patient was given metformin 500 mg, 2.5 mg of coumadin and then ibuprofen.",
"The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG"]

data_df = spark.createDataFrame(sample_df)
results = rxnorm_weighted_pipeline_re.fit(data_df).transform(data_df)
RxNorm resolver model results with EntityChunkEmbeddings annotator

New MedicalBertForSequenceClassification Annotator

We developed a new annotator called MedicalBertForSequenceClassification. It can load BERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

New MedicalDistilBertForSequenceClassification Annotator

We developed a new annotator called MedicalDistilBertForSequenceClassification. It can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

New MedicalDistilBertForSequenceClassification and MedicalBertForSequenceClassification Models

We are releasing a new MedicalDistilBertForSequenceClassification model and three new MedicalBertForSequenceClassification models.

  • bert_sequence_classifier_ade_biobert: a classifier for detecting if a sentence is talking about a possible ADE (TRUE, FALSE)
  • bert_sequence_classifier_gender_biobert: a classifier for detecting the gender of the main subject of the sentence (MALE, FEMALE, UNKNOWN)
  • bert_sequence_classifier_pico_biobert: a classifier for detecting the class of a sentence according to PICO framework (CONCLUSIONS, DESIGN_SETTING,INTERVENTION, PARTICIPANTS, FINDINGS, MEASUREMENTS, AIMS)
  • distilbert_sequence_classifier_ade : This model is a DistilBertForSequenceClassification model for classifying clinical texts whether they contain ADE (TRUE, FALSE).

Example :

...
sequenceClassifier = MedicalDistilBertForSequenceClassification\
.pretrained('distilbert_sequence_classifier_ade', 'en', 'clinical/models') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
...

sample_text = "I felt a bit drowsy and had blurred vision after taking Aspirin."

result = sequence_clf_model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))
distilbert_sequence_classifier_ade model result

Redesign of the ContextualParserApproach Annotator

  • We’ve dropped the annotator’s contextMatch parameter and removed the need for a context field when feeding a JSON configuration file to the annotator. Context information can now be fully defined using the prefix, suffix and contextLength fields in the JSON configuration file.
  • We’ve also fixed issues with the contextException field in the JSON configuration file - it was mismatching values in documents with several sentences and ignoring exceptions situated to the right of a word/token.
  • The ruleScope field in the JSON configuration file can now be set to document instead of sentence. This allows you to match multi-word entities like "New York" or "Salt Lake City". You can do this by setting "ruleScope" : "document" in the JSON configuration file and feeding a dictionary (csv or tsv) to the annotator with its setDictionary parameter. These changes also mean that we've dropped the updateTokenizer parameter since the new capabilities of ruleScope improve the user experience for matching multi-word entities.
  • You can now feed in a dictionary in your chosen format — either vertical or horizontal. You can set that with the following parameter: setDictionary("dictionary.csv", options={"orientation":"vertical"})
  • Lastly, there was an improvement made to the confidence value calculation process to better measure successful hits.

For more explanation and examples, please check this Contextual Parser medium article and Contextual Parser Notebook.

getClasses Method in RelationExtractionModel and RelationExtractionDLModel Annotators

Now you can use getClasses() method for checking the relation labels of RE models (RelationExtractionModel and RelationExtractionDLModel) like MedicalNerModel().

Example :

clinical_re_Model = RelationExtractionModel()\
.pretrained("re_temporal_events_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
clinical_re_Model.getClasses()
re_temporal_events_clinical model clases

Label Customization Feature for RelationExtractionModel and RelationExtractionDL Models

We are releasing label customization feature for Relation Extraction and Relation Extraction DL models by using .setCustomLabels() parameter.

Example :

...
reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["drug-ade, ade-drug"])\
.setCustomLabels({"1": "is_related", "0": "not_related"})
redl_model = RelationExtractionDLModel.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")\
.setCustomLabels({"1": "is_related", "0": "not_related"})
...
sample_text = "I experienced fatigue and muscle cramps after taking Lipitor but no more adverse after passing Zocor."
result = model.transform(spark.createDataFrame([[sample_text]]).toDF('text'))
Customized label results of Relation Extraction models

useBestModel Parameter in MedicalNerApproach Annotator

Introducing useBestModel param in MedicalNerApproach annotator. This param preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training.

med_ner = MedicalNerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
...
...
.setUseBestModel(True)

Early Stopping Feature in MedicalNerApproach Annotator

Introducing earlyStopping feature for MedicalNerApproach(). You can stop training at the point when the perforfmance on test/validation dataset starts to degrage. Two params are added to MedicalNerApproach() in order to use this feature:

  • earlyStoppingCriterion : (float) This is used set the minimal improvement of the test metric to terminate training. The metric monitored is the same as the metrics used in useBestModel (macro F1 when using test/validation set, loss otherwise). Default is 0 which means no early stopping is applied.
  • earlyStoppingPatience: (int), the number of epoch without improvement which will be tolerated. Default is 0, which means that early stopping will occur at the first time when performance in the current epoch is no better than in the previous epoch.

Example :

med_ner = MedicalNerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
...
...
.setTestDataset(test_data_parquet_path)\
.setEarlyStoppingCriterion(0.01)\
.setEarlyStoppingPatience(3)\

Multi-Language Support for Faker and Regex Lists of Deidentification Annotator

We have a new .setLanguage() parameter in order to use internal Faker and Regex list for multi-language texts. When you are working with German and Spanish texts for a Deidentification, you can set this parameter to de for German and es for Spanish. Default value of this parameter is en.

Example :

deid_obfuscated = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("obfuscated") \
.setMode("obfuscate")\
.setLanguage('de')\
.setObfuscateRefSource("faker")

Spark 3.2.0 Compatibility for the Entire Library

Now we can use the Spark 3.2.0 version for Spark NLP for Healthcare by setting spark32=True in sparknlp_jsl.start() function.

! pip install --ignore-installed -q pyspark==3.2.0import sparknlp_jsl
spark = sparknlp_jsl.start(SECRET, spark32=True)

Saving Visualization Feature in spark-nlp-display Library

We have a new save_path parameter in spark-nlp-display library for saving any visualization results in Spark NLP.

Example :

from sparknlp_display import NerVisualizervisualiser = NerVisualizer()visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', save_path="display_result.html")

Deploying a Custom Spark NLP Image (for opensource, healthcare, and Spark OCR) to an Enterprise Version of Kubernetes: OpenShift

Spark NLP for opensource, healthcare, and Spark OCR is now available for Openshift — enterprise version of Kubernetes. For deployment, please refer to:

Github Link: https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/platforms/openshift

Youtube: https://www.youtube.com/watch?v=FBes-6ylFrM&ab_channel=JohnSnowLabs

New Speed Benchmarks Table on Databricks

We prepared a speed benchmark table by running a clinical BERT For Token Classification model pipeline on a various numbers of repartitioning and writing the results to parquet or delta formats.

You can find the details here: Clinical Bert For Token Classification Benchmark Experiment.

New & Updated Notebooks

To see more, please check: Spark NLP Healthcare Workshop Repo

List of Recently Updated or Added Models

  • bert_sequence_classifier_ade_en
  • bert_sequence_classifier_gender_biobert_en
  • bert_sequence_classifier_pico_biobert_en
  • distilbert_sequence_classifier_ade_en
  • bert_token_classifier_ner_supplement_en
  • clinical_deidentification_es
  • ner_deid_generic_es
  • ner_deid_generic_roberta_es
  • ner_deid_subentity_es
  • ner_deid_subentity_roberta_es
  • ner_nature_nero_clinical_en
  • ner_supplement_clinical_en
  • sbiobertresolve_clinical_abbreviation_acronym_en
  • sbiobertresolve_rxnorm_augmented_re

For all Spark NLP for healthcare models, please check Models Hub Page)

--

--