Spark NLP 101: LightPipeline

Veysel Kocaman
Nov 7 · 7 min read

Faster inference in runtime from Spark NLP pipelines

Photo by Rui Xu on Unsplash

This is the second article in a series in which we are going to write a separate article for each annotator in the Spark NLP library. You can find all the articles at this link.

This article is mainly built on top of Introduction to Spark NLP: Foundations and Basic Components (Part-I). Please read that at first, if you want to learn more about Spark NLP and its underlying concepts.

In machine learning, it is common to run a sequence of algorithms to process and learn from data. This sequence is usually called a Pipeline.

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage. With the help of Pipelines, we can ensure that training and test data go through identical feature processing steps.

Each annotator applied adds a new column to a data frame that is fed into the pipeline

Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.

  • Split text into sentences
  • Tokenize
  • Normalize
  • Get word embeddings

And here is how we code this pipeline up in Spark NLP.

from pyspark.ml import Pipeline

I am going to load a dataset and then feed it into this pipeline to let you see how it works.

sample DataFrame (5452 rows)

After running the pipeline above, we get a trained pipeline model. Let's transform the entire DataFrame.

result = pipelineModel.transform(df)
result.show()

It took 501 ms to transform the first 20 rows. If we transform the entire data frame, it would take 11 seconds.

%%time

It looks good. What if we want to save this pipeline to disk and then deploy it to get runtime transformations on a given line of text (one row).

from pyspark.sql import Row

Transforming a single line of a short text took 515 ms! Nearly the same as it took for transforming the first 20 rows. So, it's not good. Actually, this is what happens when trying to use distributed processing on small data. Distributed processing and cluster computing are mainly useful for processing a large amount of data (aka big data). Using Spark for small data would be like getting in a fight with an ax:-)

Actually, due to its inner mechanism and optimized architecture, Spark could still be useful for the average size of data that could be handled on a single machine. But when it comes to processing just a few lines of text, it's not recommended unless you use Spark NLP.

Let us make an analogy to help you understand this. Spark is like a locomotive racing a bicycle. The bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it’s going to be faster in the end.

Spark is like a locomotive racing a bicycle

So, what are we going to do if we want to have a faster inference time? Here comes LightPipeline.

LightPipeline

LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

from sparknlp.base import LightPipeline

Here are the available methods in LightPipeline. As you can see, we can also use a list of strings as input text.

LightPipelines are easy to create and also save you from dealing with Spark Datasets. They are also very fast and, while working only on the driver node, they execute parallel computation. Let’s see how it applies to our case described above:

from sparknlp.base import LightPipeline

Now it takes 28 ms! Nearly 20x faster than using Spark ML Pipeline.

As you see, annotate return only the result attributes. Since the embedding array is stored under embeddings attribute of WordEmbeddingsModel annotator, we set parse_embeddings=True to parse the embedding array. Otherwise, we could only get the tokens attribute from embeddings in the output. For more information about the attributes mentioned, see here.

If we want to retrieve fully information of annotations, we can also use fullAnnotate() to return a dictionary list of entire annotations content.

result = lightModel.fullAnnotate("How did serfdom develop in and  
then leave Russia ?")

fullAnnotate() returns the content and metadata in Annotation type. According to documentation, the Annotation type has the following attributes:

annotatorType: String, 
begin: Int,
end: Int,
result: String, (this is what annotate returns)
metadata: Map[String, String],
embeddings: Array[Float]

So, if we want to get the beginning and end of any sentence, we can just write:

result[0]['sentences'][0].begin
>> 0

You can even get metadata for each token with respect to embeddings.

result[0]['embeddings'][2].metadata

Unfortunately, we cannot get anything from non-Spark NLP annotators via LightPipeline. That is when we use the Spark ML feature like word2vec inside a pipeline along with SparkNLP annotators, and then use LightPipeline, annotate only returns the result from SparkNLP annotations as there is no result field coming out of Spark ML models. So we can say that LightPipeline will not return anything from non-Spark NLP annotators. At least for now. We plan to write a Spark NLP wrapper for all the ML models in Spark ML soon. Then we will be able to use LightPipeline for a machine learning use case in which we train a model in Spark NLP and then deploy to get faster runtime predictions.

Conclusion

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data. In this article, we talked about how you can convert your Spark pipelines into Spark NLP LightPipelines to get a faster response for small data. This is one of the coolest features of Spark NLP. You get to enjoy the power of Spark while processing and training, and then get faster inferences through LightPipelines as if you do that on a single machine.

We hope that you already read the previous articles on our official Medium page, and started to play with Spark NLP. Here are the links for the other articles. Don’t forget to follow our page and stay tuned!

Introduction to Spark NLP: Foundations and Basic Components (Part-I)

Introduction to: Spark NLP: Installation and Getting Started (Part-II)

Spark NLP 101 : Document Assembler

** These articles are also being published on John Snow Labs' official blog page.

spark-nlp

Natural Language Understanding Library for Apache Spark.

Veysel Kocaman

Written by

Senior Data Scientist and PhD Researcher in ML

spark-nlp

spark-nlp

Natural Language Understanding Library for Apache Spark.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade