Spark NLP: Unlocking the Power of Question Answering

Discover the incredible capabilities of Spark NLP for advanced Question Answering models.

Gursev Pirge
John Snow Labs
11 min readMay 24, 2023

--

Photo by National Cancer Institute on Unsplash

TL;DR: Question answering can be defined as building systems for the process of retrieving information from a given knowledge source or corpus and generating an appropriate response to the questions asked by humans in a natural language. Spark NLP provides several pre-trained models that can be used for question answering.

Question answering (QA) in natural language processing (NLP) refers to the task of automatically answering questions posed in natural language, such as English or Spanish. The goal of QA is to provide accurate and relevant answers to questions that are asked by users or generated automatically, using various sources of information such as text documents, knowledge bases, or structured data.

QA’s main target is to create intelligent machines that can answer questions in the same way that humans do, by retrieving information from a given knowledge source or corpus and generating an appropriate response.

QA systems can be evaluated based on various metrics, such as accuracy, precision, and recall, using benchmark datasets such as SQuAD (Stanford Question Answering Dataset) or TREC (Text REtrieval Conference).

QA systems have many practical applications, such as customer support chatbots, search engines, and virtual assistants (Siri, Alexa etc.), where they can help users quickly find the information they need without having to navigate through large amounts of data manually. QA is also used in fields such as education, healthcare, and finance, where it can help improve decision-making and information retrieval.

In this article, we will discuss the MultiDocumentAssembler annotator and Transformers-based Question Answering annotators of Spark NLP. We also have another post for TAPAS (Table Parser) — Question Answering Systems, which are specifically designed to handle natural language questions over tabular data.

The entry point for every Spark NLP pipeline is to get raw data transformed into Document type at first. DocumentAssembler and MultiDocumentAssembler are the annotators that prepare data into a format that is processable by Spark NLP. The difference between them is that MultiDocumentAssembler can take multiple documents as input, which is useful when you have multiple documents that you want to process together as a single batch.

MultiDocumentAssembler annotator processes the input documents and outputs a DataFrame, where each row corresponds to a single document and contains both the raw text and metadata fields as separate columns. This DataFrame can then be used as input to other Spark NLP components.

Transformers are a type of neural network architecture used in NLP that has revolutionized the field. They were first introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017. The paper introduces the Transformer, a neural network architecture that relies on a self-attention mechanism to process input sequences, such as sentences or paragraphs of text. Unlike traditional recurrent neural networks, Transformers can process sequences in parallel, making them faster and more efficient.

Transformers are a natural fit for answering answering natural language questions because they can model complex relationships between words and phrases within a text sequence.

Natural language questions are questions that are formulated using the same language that humans use in everyday conversations, both spoken and written. They reflect the way people naturally communicate with each other.

Examples of natural language questions include open-ended questions that ask for information or opinions, such as "What do you think about this idea?" or "How was your day today?" They can also include closed-ended questions that require a specific answer, such as "Did you like the movie?" or "Are you feeling better?"

The ability to understand and generate natural language questions is an important aspect of natural language processing, as it requires machines to be able to interpret and produce language that is similar to the way that humans use language.

In particular, transformers are able to identify the most relevant parts of a text sequence for answering a given question, by attending to the relevant parts of the text through self-attention.

The self-attention mechanism in transformers allows them to weigh the importance of different parts of the input sequence when making predictions. This means that when answering a question, the transformer can focus on the most relevant parts of the text, taking into account the entire sequence instead of just a small window of words.

QA with Transformers is an NLP approach for answering questions using Transformer-based deep learning models. In the context of question answering, Transformer-based models are trained on large amounts of text data, such as Wikipedia or news articles, to learn to extract relevant information from a given passage of text and generate an answer to a user’s question. The models achieve this by using a combination of self-attention mechanisms and multi-layer neural networks to encode the input text, attend to relevant information, and generate an output answer.

In this post, you will learn how to use Spark NLP to perform question answering using transformer-based models.

Spark NLP has multiple approaches for question answering from a text. In this article, we will discuss:

  1. Using MultiDocumentAssembler, an annotator that prepares data into a format that is processable by Spark NLP.
  2. Using various transformer-based models for question answering.
  3. Answering questions over tables by TAPAS.

Let us start with a short Spark NLP introduction and then discuss the details of question answering techniques with some solid results.

Introduction to Spark NLP

Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:

  • A single unified solution for all your NLP needs
  • Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
  • The most widely used NLP library in industry (5 years in a row)
  • The most scalable, accurate and fastest library in NLP history

Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.

Spark NLP processes the data using Pipelines, structure that contains all the steps to be run on the input data:

Spark NLP pipelines

Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.

Setup

To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:

pip install spark-nlp
pip install pyspark

For other installation options for different environments and machines, please check the official documentation.

Then, simply import the library and start a Spark session:

import sparknlp

# Start Spark Session
spark = sparknlp.start()

MultiDocumentAssembler

MultiDocumentAssembler will be the entry point for every Spark NLP pipeline and it will prepare data into a format that is processable by Spark NLP. This annotator can read either a String column or an Array[String].

MultiDocumentAssembler can be used as a preprocessing step before applying other Spark NLP annotators; in this case, before transformer-based question answering annotators.

Question Answering with Transformers

Question Answering involves finding an answer to a question based on a given context. With the advent of deep learning and Transformer-based models, the performance of QA systems has improved significantly.

Transformer-based models, such as BERT, Albert, RoBERTa etc., use self-attention mechanisms to process input sequences and have achieved state-of-the-art results in several NLP tasks, including QA.

During inference, the model takes the context and question as input and generates a probability distribution over all possible answer spans in the context. The answer span with the highest probability is returned as the predicted answer.

Spark NLP provides several pre-trained Transformer-based models for question answering and they use the following transformers:

· AlbertForQuestionAnswering,

· BertForQuestionAnswering,

· DeBertaForQuestionAnswering,

· DistilBertForQuestionAnswering,

· LongformerForQuestionAnswering,

· RoBertaForQuestionAnswering,

· XlmRoBertaForQuestionAnswering.

Let us work on some examples with BertForQuestionAnswering; please check the links for the annotators above to get more detailed information. They also work quite similarly.

As discussed above, MultiDocumentAssembler will be the first stage in the pipeline to transform the text to document. Next step will be to use BertForQuestionAnswering transformer to produce the answer to the question.

Let us work on more complex text and questions in addition to one simple case. First define the sample texts and questions:

texts = [
"The human brain is an incredibly complex organ that is responsible for
controlling all of our bodily functions, as well as our thoughts,
emotions, and behaviors. It is made up of approximately 100 billion
neurons.",

"One of the key features of the human brain is its ability to change and
adapt in response to experiences, a process known as neuroplasticity.
This means that the brain can reorganize itself, forming new connections
between neurons and even generating new neurons in certain areas,
based on the demands placed upon it. Neuroplasticity is what allows us
to learn and develop new skills throughout our lives, and it is also
what enables us to recover from injuries and diseases that damage the
brain.",

"Winston Churchill, one of the most iconic figures in modern history,
was born in the city of Oxford, England on November 30, 1874.
His birthplace was a grand home known as Blenheim Palace, which is
located in the village of Woodstock, just a few miles outside of Oxford.",

"France is a country located in Western Europe. Its capital is Paris."
]

question = [
"How many neurons are there in the human brain?",
"What is neuroplasticity?",
"What is the birthplace of Winston Churchill?",
"What is the capital of France?"
]

Create a Spark dataframe:

data = spark.createDataFrame(
[
[texts[0], question[0]],
[texts[1], question[1]],
[texts[2], question[2]],
[texts[3], question[3]],
]
).toDF("context","question")

data.show(truncate =100)

Now, define the question answering pipeline and fit the data:

# Import the required modules and classes
from sparknlp.base import MultiDocumentAssembler, LightPipeline
from pyspark.ml import Pipeline
from sparknlp.annotator import BertForQuestionAnswering

# Step 1: Transforms raw texts to `document` annotation
document_assembler = MultiDocumentAssembler()\
.setInputCols("question", "context")\
.setOutputCols("document_question", "document_context")

# Step 2: Get the answers
question_answering = BertForQuestionAnswering.pretrained("bert_base_cased_qa_squad2") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer")

#Define pipeline
pipeline = Pipeline(stages=[document_assembler, question_answering])

empty_data = spark.createDataFrame([["",""]]).toDF("question", "context")

# Fit the dataframe to get the model
model = pipeline.fit(empty_data)

Transform the dataframe of texts and questions to get predictions (answers):

model.transform(data)\
.selectExpr("document_question.result as Question", "answer.result as Short_Answer")\
.show(truncate=False)

Light Pipeline

Let’s also use LightPipeline here to extract the entities. LightPipeline is a Spark NLP specific Pipeline class equivalent to the Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data.

context ="The human brain is an incredibly complex organ that is responsible 
for controlling all of our bodily functions, as well as our thoughts,
emotions, and behaviors. It is made up of approximately 100 billion
neurons."

question = "How many neurons are there in the human brain?"

light_model = LightPipeline(model)

light_result = light_model.annotate([question],[context])

light_result

Or just the answer to the question:

light_result[0]["answer"]

One-liner alternative

In October 2022, John Snow Labs released the open-source johnsnowlabs library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all John Snow Lab’s libraries, and can be installed with pip:

pip install johnsnowlabs

Please check the official documentation for more examples and usage of this library. To run Question Answering with one line of code, we can simply:

# Import the NLP module which contains Spark NLP and NLU libraries
from johnsnowlabs import nlp

nlp.load("en.answer_question.squadv2.bert.base_cased.by_deepset").predict("""How many neurons are there in the human brain?|||"The human brain is an incredibly complex organ that is responsible for controlling all of our bodily functions, as well as our thoughts, emotions, and behaviors. It is made up of approximately 100 billion neurons.""")
After using the one-liner model, the result shows the answer to the question.

The one-liner is based on default models for each NLP task. Depending on your requirements, you may want to use the one-liner for simplicity or customizing the pipeline to choose specific models that fit your needs.

NOTE: when using only the johnsnowlabs library, make sure you initialize the spark session with the configuration you have available. Since some of the libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with spark = nlp.start(nlp=False). The default parameters for the start function includes using the licensed Healthcare NLP library with nlp=True, but we can set that to False and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.

For additional information, please consult the following references.

Conclusion

In this article, we tried to get you familiar with the basics of question answering. Question answering with Transformer-based models is a highly effective approach for answering natural language questions based on a given context. With their ability to capture complex relationships between words and their context, transformers have become the go-to model for many NLP tasks, including question answering.

Spark NLP also provides a variety of pre-trained models, built on top of state-of-the-art transformers models such as BERT and RoBERTa, which provide accurate and efficient answers to complex questions.

Spark NLP’s question answering solutions can be used in a wide range of applications, including chatbots, virtual assistants, and search engines. They can also be used in industries like healthcare, where accurate and efficient answers to complex questions are essential.

--

--