Spark NLP Walkthrough, powered by TensorFlow

This article is a walkthrough of solving a realistic use case of natural language understanding in healthcare using Spark NLP Deep Learning Named Entity Recognition (DL-NER).

What is Spark NLP?

Spark NLP is an open source natural language processing library, built on top of Apache Spark and Spark ML. It provides an easy API to integrate with ML Pipelines. It is commercially supported by John Snow Labs. Spark NLP’s annotators utilize rule-based algorithms, machine learning and some of them Tensorflow running under the hood to power specific deep learning implementations. We’ll be exploring the latter here in more detail.

The library covers many common NLP tasks, including tokenization, stemming, lemmatization, part of speech tagging, sentiment analysis, spell checking, named entity recognition, and more. The full list of annotators, pipelines and concepts is described in the online reference. All of them are included as open-source and may be used by training models with your data. It also provides pretrained models, although they serve as a way of getting a feeling on how the library works, and not for production use.

As a native extension of the Spark ML API, the library offers the capability to train, customize and save models so they can run on a cluster, other machines or saved for later. It is also easy to extend and customize models and pipelines, as we’ll do here.

What are we going to do?

This article focuses on a DL-NER annotator that will extract healthcare problems from a corpus of clinical notes in PDF format. Several interesting features come together in this story:

  • Large PubMed based word embeddings corpus, tried at various dimension sizes, provides good enough coverage for production use cases. Meaning 99.7% and more of the tokens have a specific vector equivalent (Metric based on our own experiments). Note: We are not allowed to re-distribute these PubMed word embeddings explicitly, but you may be able to find them on the internet.
  • TensorFlow based model that combines CNN and LSTM layers for sub-word analysis. This is part of Spark NLP library, and it includes graphs required to both train and predict.
  • Spark framework, including a cluster-based scenario, distributed work environment, and storage.
  • Spark NLP annotators dealing with several different stages of NLP.

But, what is Spark?

Spark is an open source, Apache licensed, analytics engine and library for large-scale computation. By learning its API, users can easily produce programs that automatically run in distributed computing environments through the concept of RDD, which stands for resilient distributed datasets.

Over time, Spark has grown room for data scientists and have implemented the DataFrame API, which allows the users not having to learn anything about the underlying RDDs, and focus on data as Schema-Based tables. The power of Spark is that it allows storing these Datasets in distributed memory, hence making it very fast for recurring algorithms that access the data repeatedly.

Taking advantage of Spark jumps in when having a distributed storage, large enough data and algorithms that can be run in parallel to even the load. Another cool thing is, Spark has API for Java, Python and R aside from Scala which is the main language.

Spark has a module called Spark ML which introduces several Machine-Learning components. Estimators, which are trainable algorithms, and transformers which are either a result of training an estimator, or an algorithm that doesn’t require training at all. Both Estimators and Transformers can be part of a Pipeline, which is no more and no less than a sequence of steps that execute in order, and are probably depending on each other’s result.

Spark-NLP introduces NLP annotators that merge within this framework and it’s algorithms are meant to predict in parallel. All annotators are either Estimators (we call them AnnotatorApproach, e.g. NerCrfApproach) or Transformers (We call them AnnotatorModel, e.g. SentimentDetectorModel), and for those transformers that do not require training at all, simply by their name (e.g. Tokenizer).

Spark NLP is open source with an Apache 2.0 license, so you are welcome to examine the full source code

Knowledge prerequisites

Even though we did a very short Spark Intro. It is recommended if the reader is familiar with:

  • Apache Spark 2.x
  • TensorFlow

As we won’t be covering the basics for these topics. The notebooks presented below are written both in Scala and Python, as a way of showing the library works in both languages.

Getting started, basic component introductions

Working together: Spark-NLP and TensorFlow

In this visualization, you can see the general flow of annotators and annotations in Spark NLP. As part of the Pipeline you can have various annotators which are of a different nature, such as a Part-of-speech or a tokenizer. Among these, somewhere, the NerDL annotator will be in the pipeline. The types of annotators are:

  1. Rule-based annotators, which apply standard strategies utilizing Regular Expressions and simple math and statistics (such as a Tokenizer, or a dictionary-based Sentiment Analysis detector).
  2. Machine learning models, that include their own implemented statistics and math, within their training stage (called Annotator Approach) and their trained model counterpart to SparkML Transformers (called Annotator Models by Spark-NLP).
  3. A third family is of our interest here: The Deep Learning models. These include a pre-generated graph object, and a full-API on Spark-NLP that allows training them right away by just providing some corpora and parameters. Once trained, the annotator model will automatically link the prediction stage to TensorFlow, feeding its session with the data its layers need. Such an annotator will act the same way any other annotator works on Spark-NLP.

It is also possible to train the complete model on TensorFlow, in Python, and convert it to a Spark-NLP Annotator Model equivalent. We’ll approach this later in the article.

In this article, we use Spark-NLP 1.7.3, and TensorFlow 1.8.0 with its Java bindings.

Word embeddings

Working with text is not always convenient when we desire to find relationships between characters and words. Hence, word embeddings are a key entry point for the understanding of text in context. They are fundamental to measure word similarity and distance at a mathematical view, and it allows us to compute relationships of their vector representation in space.

Spark NLP includes a unique capability of distributing word embeddings efficiently and automatically within a distributed Spark cluster. As of Spark NLP 1.7.3, these embeddings can be in three different formats: TEXT, BINARY or SPARK embeddings. Being the latter an internal serialized form of the embeddings in order to take the most out of it when working in Cluster mode. The library will make sure the workers on different nodes are able to reach such resources.

Corpus and OCR processing

Starting on version 1.6.0, Spark NLP includes OCR Capabilities. It provides distributed OCR and can handle both digital text and scanned documents. This enables converting PDF files into a ready to go DOCUMENT annotation, which is the entry point to Spark-NLP. Such an annotation can later be sentence-detected and even exploded into multiple sentences per rows. Such output will then be in an already-distributed Spark DataFrame, and this may be a target for whatever requirements our DL-NER will have (we’ll get into this later).

The scenario: Searching for healthcare problems in text and returning database code

Imports and core concepts

Let’s get down to work, first things first, we call the necessary imports

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.annotators.ner.NerConverter
import org.apache.spark.ml.Pipeline

base._ and annotator._ include basic annotators and concepts required for general usage of Annotators.

NerConverter is a particular annotator that will convert NER’s IOB formatted output into a chunk of text that holds to the entity actually found in text. This will be a useful input for the EntityResolver annotator later. Things will make sense slowly.

Finally, Pipeline is a spark class for combining different annotators in a single place, before fitting and predicting.

Our dataset will be a composed DataFrame made out from PubMed PDF clinical abstracts, we utilized Spark-NLP’s OCR reader here.

import com.johnsnowlabs.nlp.util.io.OcrHelper
val clinicalRecords = OcrHelper.createDataset(spark, "/home/data/clinical_records/", "text", "text_metadata")
clinicalRecords.show()

We are taking a small sample of the PubMed dataframe to make this understandable. This will be the dataset to predict.

The workhorses: Spark NLP annotators

val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val token = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normal")
val ner = NerDLModel.load("/home/models/i2b2_normalized_109")
.setInputCols("normal", "document")
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols("document", "normal", "ner")
.setOutputCol("ner_converter")

Let’s decompose this code a little bit, although yes, it is as easy as it looks.

  1. DocumentAssembler is the annotator taking the target text column, making it ready for further NLP annotations, into a column called document.
  2. SentenceDetector will identify sentence boundaries in our paragraphs. Since we are reading entire file contents on each row, we want to make sure our sentences are properly bounded. This annotators correctly handles page breaks, paragraph breaks, lists, enumerations, and other common formatting features that distort the regular flow of the text. This help the accuracy of the rest of the pipeline. The output column for this is the sentence column.
  3. Tokenizer will separate each meaningful word, within each sentence. This is very important on NLP since it will identify a token boundary.
  4. Normalizer will clean up each token, taking as input column token out from our Tokenizer, and putting normalized tokens in the normal column. Cleaning up includes removing any non-character strings. Whether this helps or not on a model, is a decision of the model maker.
  5. NerDLModel is our pipeline hero. This annotator will take as inputs the normalized tokens, plus the original strings (since every annotator has a specific number of requirements, we are providing). In this scenario, we are loading a healthcare model from disk (Which is licensed by John Snow Labs Spark NLP for healthcare). In the next section, we’ll be explaining how this was generated
  6. NerConverter is an auxiliary annotator specifically made for the NER models of Spark NLP. This one will take the entity tags and build up the chunk of text which represents the entity found, relative to the original sentence.

Putting things together: Spark Pipelines

val pipeline = new Pipeline().setStages(Array(
    document,
    sentenceDetector,
    token,
    normalizer,
    ner,
    nerConverter
))
val model = pipeline.fit(Seq.empty[String].toDF("text"))

As you can see, a pipeline takes as stages each one of our annotators in the right order. Notice we are calling fit() which is Spark ML’s training function, but we are passing an empty dataframe with a column called text to comply with DocumentAssembler’s and general pipeline requirements. But, since we are not really training anything here, an empty Dataframe is alright.

Sidebar: But, where did NerDLModel come from originally?

NerDLModel is the result of a training process, originated by NerDLApproach SparkML estimator. This estimator is a Tensorflow DL model.

There are two ways of loading Tensorflow models into Spark NLP: Utilizing pre-generated graphs or PythonReader which reads tensorflow models into NerDLModel. We’ll present both here.

Utilizing Spark NLP’s pre-generated graphs, we can train directly from the library, utilizing a familiar API as follows. Python Spark NLP:

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = SentenceDetector()\
.setInputCol(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCol(["document"])\
.setOutputCol("token")

nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(109)\
.setExternalDataset("/home/data/clinical.train")\
.setValidationDataset("/home/clinical.validation")\
.setTestDataset("/home/clinical.testing")\
.setEmbeddingsSource("/clinical.embeddings.100d.txt", 100, 2)\
.setRandomSeed(0)\
.setIncludeEmbeddings(True)\
.setVerbose(2)

converter = NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_span")

finisher = Finisher() \
.setInputCols(["ner_span"])\
.setIncludeKeys(True)

pipeline = Pipeline(
stages = [
documentAssembler,
sentenceDetector,
tokenizer,
nerTagger,
converter,
finisher
]
)

As you can see, the Pipeline interface is the same. If we were fitting this scenario, we would also fit on an empty dataset, since such training is coming from ExternalDataset — files that are outside the pipeline.

NER models in Spark NLP, as of today, can be only trained on CoNLL format, which looks like this:

Basically, sentence bounded rows with a token per row and its associated label on the fourth column, labeled as IOB format. We generated CoNLL dataset by using Spark-NLP’s tokenizer, in order to make sure we follow the same standard.

Once trained, we just save it, as follows:

val model = pipeline.fit(Seq.empty[String].toDF("text"))
model.getStages(3).write.overwrite.save(“/home/models/i2b2_normalized_109”)

To be able to train such NerDL model, we needed to provide these Tensorflow graphs:

char_cnn_blstm_10_100_100_25_30.pb

Which was generated after creating the appropriate model in Python (as explained further below).

The TensorFlow Model

Our NER model follows a Bi-LSTM with Convolutional Neural Networks scheme, utilizing word embeddings for token and sub-token analysis(In this case, 100 dimensions, but we usually end up running with 200 dimensions which yielded better results).

with tf.device('/gpu:0'):
with tf.variable_scope("char_repr") as scope:
# shape = (batch size, sentence, word)
self.char_ids = tf.placeholder(tf.int32, shape=[None, None, None], name="char_ids")
# shape = (batch_size, sentence)
self.word_lengths = tf.placeholder(tf.int32, shape=[None, None], name="word_lengths")
with tf.variable_scope("word_repr") as scope:
# shape = (batch size)
self.sentence_lengths = tf.placeholder(tf.int32, shape=[None], name="sentence_lengths")
with tf.variable_scope("training", reuse=None) as scope:
# shape = (batch, sentence)
self.labels = tf.placeholder(tf.int32, shape=[None, None], name="labels")
self.lr = tf.placeholder_with_default(0.005,  shape=(), name="lr")
self.dropout = tf.placeholder(tf.float32, shape=(), name="dropout")

We put this information into a class called NerModel and then add the appropriate layers, for which the full implementation does not fit here.

# this will create a graph as per requirements of the model
ner = NerModel()
# now we’re adding nodes to our TF graph
ner.add_cnn_char_repr(101, 25, 30)
ner.add_pretrained_word_embeddings(30)
ner.add_context_repr(10, 200)
ner.add_inference_layer(False)
ner.add_training_op(5.0)
ner.init_variables()

This leads us to the second way we can load Python tensorflow models into NerDLModel. Which is load it on Spark NLP with the provided reader, apart from the first case which was save the graph into a *.pb file and train from Spark NLP just as shown earlier.

Following our own custom saving format, we do the following for the former, in order to generate the graph file, and then train from Spark NLP as shown above:

def save_models(folder):
saveNodes = list([n.name for n in tf.get_default_graph().as_graph_def().node
if n.name.startswith('save/')])
if len(saveNodes) == 0:
saver = tf.train.Saver()
variables_file = os.path.join(folder, 'variables')
self.ner.session.run('save/control_dependency', feed_dict={'save/Const:0': variables_file})
tf.train.write_graph(self.ner.session.graph, folder, 'saved_model.pb', False)

Or for the latter, reading it with our reader, once trained directly in Python:

import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModelPythonReader
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsFormat
val model = NerDLModelPythonReader.read(
    "/home/model/tensorflow/python_tf_model",
    ResourceHelper.spark,
    normalize = true,
    WordEmbeddingsFormat.BINARY
)
model.write.overwrite().save(“/home/models/i2b2_normalized_109”)

You may have noticed that we trained using a GPU device. If the graph was trained with GPU’s, it will attempt to do so wherever you read it from (i.e. in the Spark NLP pipelines) for both training and prediction. We use allow_soft_placement Tensorflow config option to continue anyway if a GPU device is not available. To use GPU it may be required to include additional dependencies into the Spark NLP build.sbt.

Moving forward: Back to our Pipeline

We now have our pipeline. Let’s predict with it:

val result = model.transform(clinicalRecords)

result.show()

With this you get an idea of what was going on, however, this isn’t a convenient way to see the results. Let’s dig deeper by adding a Finisher annotator, and transforming only our desired annotators’ output. It goes like this:

val finisher = new Finisher()
.setInputCols("normal", "ner")
.setOutputCols("finished_normal", "finished_ner")
val finished_result = finisher.transform(result.select("normal", "ner"))
finished_result.take(1).foreach(println(_))

This is better but not yet optimal. This still requires extra work to ‘eye-ball’ the tokens and match the results. But Spark-NLP has something useful for dealing with small samples and predictions: LightPipelines

Running small predictions and samples with LightPipelines

LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

LightPipelines are easy to create and also save you from dealing with Spark Datasets. They are also very fast and, while working only on the driver node, they execute parallel computation. Let’s take a look:

val lightPipeline = new LightPipeline(model)
val annotations = lightPipeline.annotate(
"The findings of many authors show that reduced fetal growth is followed by increased mortality from cardiovascular disease in adult life. They are further evidence that cardiovascular disease originates, among other risk factors, through programming of the bodies structure and metabolism during fetal and early post-natal life. Wrong maternal nutrition may have an important influence on programming."
)
annotations("normal").zip(annotations("ner")).foreach(println(_))

As you can see, LightPipelines are very fast and a convenient way of trying out a single sentence against your entire Pipeline model.

Conclusion

We hope this article gives you an insight of what can be done using Spark NLP, TensorFlow and a bit of creativity. The library comes out of the box with TensorFlow graphs for named entity recognition and assertion status detection. More will be added in the future.

Spark NLP keeps improving — we’ve already delivered 27 new releases since the beginning of 2018! The community is growing quickly as well. The most active discussions happen on the Spark NLP Slack community (the link is to an email address to which requests to join should be sent) or on the Spark NLP GitHub page. We read both daily. Give it a go!