Semantic Similarity with SparkNLP

Published in

spark-nlp

7 min readFeb 6, 2023

With the explosion of text data in today’s digital age, understanding the meaning behind words is more important than ever. Natural Language Processing (NLP) experts are constantly working to develop new ways to publish, analyze and understand text data, with determining the similarity in meaning between different parts of text being a crucial aspect for a variety of NLP applications. In this article, we will dive into the world of semantic similarity and explore how it can be used to improve NLP applications such as information retrieval, recommendation systems, text summarization, and question-answering.

We will also delve into the DNN-based techniques, specifically the transformer-based methods available in Spark NLP, and show how they can be exposed as a RestAPI using Apache Spark, Scala, and Play Framework. Additionally, we will explore the advantages of using Scala with Spark over Pyspark, including performance, expressiveness, and compatibility with Java-based libraries.

This article includes a use case for movie recommendations using semantic similarity with Glove and BERT embeddings in Spark NLP. The full code can be found in this GitHub repository. Feel free to download and use it as a reference or to build upon for your own projects.

Background Semantic Similarity

With all the text being generated these days, NLP is a big deal for AI experts. Determining the similarity in meaning between different parts of the text, such as words, sentences, or documents, is crucial for a variety of NLP applications like information retrieval, text summarization, text classification, essay evaluation, machine translation, and question answering, among others.

There are some techniques to measure semantic similarity between texts, the most known are:

Knowledge-based
Corpus-based
Deep Neural Network-based
Hybrid

In this article, we will focus on the Deep NN-based technique, in particular the Transformer-based method.

Background Spark and Spark-NLP

Apache Spark is a distributed computing system that processes a huge volume of data at scale. It uses a cluster of computers to divide and process data in parallel, which allows it to handle very large datasets. Spark is commonly used for big data processing, machine learning, and other data-intensive tasks.

Spark NLP is a Natural Language Understanding Library built on top of Apache Spark, leveraging Spark MLLib pipelines, that allows you to run NLP models at scale, including SOTA Transformers. Therefore, it’s the only production-ready NLP platform that allows you to go from a simple PoC on 1 driver node, to scale to multiple nodes in a cluster, to process big amounts of data in a matter of minutes.

Background Scala and Play Framework

Scala is a programming language that is designed to combine the best aspects of both object-oriented and functional programming. It is known for its conciseness and expressiveness, and its static type system can help to avoid bugs in large and complex applications. Additionally, because it runs on the Java Virtual Machine (JVM), it has access to the vast ecosystem of Java libraries, while also being able to run on JavaScript runtime.

The Play Framework is a web application framework for building web applications using Java and Scala. Play emphasizes simplicity and ease of use, and it is based on a lightweight, stateless, web-friendly architecture that is designed to make it easy to develop web applications with minimal boilerplate code.

Real-world Use Cases

There are some use cases that semantic similarity is currently solving, to name a few:

Information Retrieval: A search engine can use it to match queries to relevant documents.
Recommendation Systems: To recommend similar products or services to users.
Research: In medical research, semantic similarity can be used to find similar cases in large medical databases, or to identify related concepts in scientific literature.

These are just a few examples of the many potential uses of semantic similarity. With the increasing amount of text and data available, the ability to find similarities in meaning is becoming more and more important across many different verticals such as Healthcare, Finance, and Legal domains where NLP solutions are highly valuable.

Advantages of Scala with Spark over Pyspark

There are some benefits to using Spark with Scala compare to Pyspark, particularly in terms of performance and expressiveness.

Performance: Since Spark’s core engine is implemented in Scala, it can be more efficient when processing data using the native Scala API, as it doesn’t have to go through the overhead of the Py4j library which PySpark uses to connect Python with the Scala-based Spark Core engine.
Expressiveness: Scala is a more expressive language than Python, which means that you can do more with fewer lines of code and it is a statically typed language which allows for better type safety and easier reasoning about the code.
Java-based libraries compatibility: Scala runs on top of JVM, giving access to all the Java libraries, which opens up a lot of options and tools available to use with Spark that could be hard or even impossible to use with Pyspark.

Keep in mind that PySpark is good for data exploration and prototyping, it’s also a good choice if you already have an existing Python codebase that you want to integrate with Spark or if your dev teams are more familiar with Python or libraries for ML like sci-kit-learn, PyTorch, etc.

Building a Recommendation System

In this article, we are going to focus on the ranking process of a recommendation system that uses a funnel-based approach. The first stage will filter data using a fast ML algorithm like K-NN, then we will rank it using a more sophisticated mechanism, in this case, our semantic similarity engine.

Assumptions:

The movie dataset is the result of a preprocessing step, and the best candidates for recommendation were established using K-NN.

Implementation

As an example to maximize the performance and minimize the latency, we are going to store two Spark NLP pipelines in memory, so that we load only once (at server start) and we just use them every time we get an API request to infer.

To do this, we have to call some method during bootstrapping to load the required Spark NLP models. So, at the root of the application, we create Module class that loads the model to memory.

PipelineServiceImpl must be annotated as Singleton, in this way the object will be created once at start-up and loaded from disk, such that the next time it is accessed it will be already in memory.

The embeddings_pipeline is a pre-trained pipeline model that we saved offline, after experimentation. The pipeline has the following stages:

For these code snippets, we are using WordEmbeddings but you can use more sophisticated embeddings like BERT, or DistilBERT, among many others that Spark NLP has available in our models hub. Check the complete code that uses BERT and DistilBERT here.

Now, to expose the endpoint that will process the recommendation requests, we will use a RecommendationsController

Remember to add the route to the routes config file

GET /movieRecommendations controllers.RecommendationsController.movieRecommendations()

RecommendationService encapsulates all the complexity to recommend 5 movies for a given text. Which basically includes:

Computing Similarity
Ranking Score

Now, if we start the application sbt runProd we can see during the startup of the service in the output it loads in memory all the models before making the endpoint available.

[warn] o.a.h.u.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Loading glove embeddings pipeline from disk
2023-01-15 09:40:38.227309: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading pretrained glove embeddings pipeline from disk
[info] play.api.Play - Application started (Prod) (no global state)
[info] p.c.s.AkkaHttpServer - Listening for HTTP on /0:0:0:0:0:0:0:0:9000

With the AkkaHttp server up, we can finally make a request to this endpoint and get the movie recommendations. We used Postman to make the requests.

First, list see the results with the text “a story involving a criminal person”

Request/Response for DistilBERT Embeddings

As we can see Glove has good results, and even seems a little better than DistilBERT, but BERT has the best results as expected.

Let’s check the behavior with another text: “a beautiful girl”

The results are quite intriguing. GloVe has fair results, with one result that mentions a woman but doesn’t mention her appearance. It even has one result that BERT inferred. DistilBERT has one result that does not mention the appearance of a woman. BERT also has fair results, but at first glance, it is not as good as GloVe. However, upon further examination, BERT’s scores are lower than GloVe’s. This suggests that if we were to add an additional filter to the scores, such as a threshold of 0.56, BERT would have more reliable results.

Do you want to know more?

Check the example notebooks in the Spark NLP Workshop repository, available here
Visit John Snow Labs and Spark NLP Technical Documentation websites
Follow us on Medium: Spark NLP and Veysel Kocaman
Write to support@johnsnowlabs.com for any additional requests you may have

Semantic Similarity with SparkNLP

Do you want to know more?

Written by Danilo Burbano