Get started with Machine Learning in Java using Spark NLP
It is increasingly common for software developers to require Machine Learning technology in their applications. While Python is the de facto standard environment for Machine Learning, this might not be an ideal fit when building web applications or enterprise software. Learn below how to train and run Machine Learning models in Java, using the Spark NLP open source library.
At Parito, we are building AI-driven applications that apply machine learning algorithms to text data. For the back-end we are using Java and Spring Boot, which provides a robust and secure framework for building REST web services. While looking at options for the Machine Learning component, we came across Spark NLP, an open source library for Natural Language Processing based around the Machine Learning library (MLlib) in Apache Spark.
The claim was that this would be easy to use in Java, as it’s Scala-based. The reality is that it took a bit of fiddling to get things going, so we thought it would be worth sharing our findings, as good Java examples seem to be a bit scarce.
Note that this is purely a technical article on getting Spark NLP running within a Java/Spring Boot REST service. I will explain further articles more what Spark NLP has to offer functionally and an evaluation on the effectiveness of integrating it in this fashion.
For a fully working example, check here: https://github.com/willprice76/sparknlp-examples/tree/develop/sparknlp-java-example
Basic setup — Dependencies
Firstly, it’s important to understand the underlying dependencies of Spark NLP. At the time of writing, the latest version was built on Apache Spark 2.4.4, and uses Scala 2.11 (the Scala version is included in the name of the artifact).
I will show how to get a REST API working with Spring Boot Web so you need add the following dependencies:
- Spring Boot
- Spark MLlib
- Spark NLP
You will end up with a basic pom.xml something like this:
To avoid Slf4j conflicts you will need to exclude log4j from the spark MLlib dependency (or you can just use log4j if you prefer).
Now we can add the Spring Boot application class:
… and a controller with a test hello world method :
Now compile and run your application, you should be able to test the controller using a GET request to the URL http://localhost:8080/hello and get a response with the text “Hello world”.
Initialize Spark NLP and download a pre-trained pipeline
Now we are going to check that we can run Spark NLP. Update your controller as follows:
All we are doing here is initializing Spark NLP and starting a Spark Session in the constructor and downloading a pre-trained pipeline for sentiment analysis (more explanation to follow).
Now compile and run the application to check everything is working OK.
If you get an exception like this:
…Constructor threw exception; nested exception is java.lang.IllegalArgumentException: Unsupported class file major version 55
…you are not running Java 8 — check your IDE run/debug configuration, as this can be different to what’s specified in your project pom.
Pipelines are simply a number of processing stages which transform the data from the previous stage. Spark NLP caches the downloaded pipeline in the cached_pretrained folder in your home directory.
If you open up this folder you will see the sentiment analysis pipeline in a folder named something like analyze_sentiment_en_2.4.0_2.4_1580483464667. Within it the pipeline stages as numbered subfolders within the stages folder. In this case we have 5 stages:
I won’t go into details, but the first four stages prepare the text data by breaking it into sentences, tokenizing it and correcting spelling mistakes. The final stage is a machine learning model which can infer the sentiment from the processed text data. It’s important to understand a bit about pipelines, if you are going to work with Spark NLP especially when you train your own models (coming later in this article).
Generate some insights
We are now ready to do start using machine learning to generate insights from data. This is sometimes also known as scoring or annotating.
We are going to use the pre-trained model for sentiment analysis which we downloaded in the previous step. This means you can provide text data as input, and the model will infer if the sentiment in the text is positive or negative. In a commercial context this could be useful if you are trying to automatically gauge overall customer satisfaction based on large volumes of data from, for example product reviews, customer support incidents, or social media postings.
Add a score method to your controller as follows:
Here we simply use the annotate method on the pipeline we downloaded, to infer the sentiments on an array of input strings. Getting the sentiment results out is a bit of a fiddle mostly due conversion between Scala and Java objects. I am new to this, so if you have a neater way please let me know!
Run the application and use curl or Postman or some other tool to Post some data to your controller to try it out. For example the command:
Gives response corresponding to the 2 input sentences of:
Train your own model
So far so good, but what happens if the pre-trained model doesn’t always infer the insight you expect. This is often the case; pre-trained models are trained on data in a different context or domain from where you want to apply them, which gives them a particular bias. You probably want to train your own model, on your own data.
In order to do this, you will need to create your own pipeline, and you will need some data with known outcomes (in our case pre-labelled with sentiment; positive or negative).
We create a training pipeline using the same type of model as in the pre-trained pipeline, its not exactly the same pipeline as we skip the bits to break into sentences and do spell checking for simplicity. Add the following method to your controller:
This simply defines the minimum pipeline we need to train this particular sentiment analysis model, ensuring that the name of the output column of each stage matches the input columns for subsequent stages.
Now add a new class to represent the input data (text + sentiment):
And create an endpoint to train the pipeline with a list of TextData elements:
The Pipeline.fit() method is where the training happens, returning a PipelineModel which can be used for scoring. We convert this into a LightPipeline as this is a more efficient way to score when with working with smaller datasets on a single machine.
Finally we need to update the pom.xml with some additional dependencies, which help us work with data using Spark.
Normally you would need a significant amount (hundreds or thousands of rows) of labelled data to train a model properly, but just to check if things are working, we can do a simple post request to this new endpoint:
And then test the (overwritten) scoring pipeline with a request to the /sentiment/score endpoint.
It’s important to understand that the Spark pipeline concept relies on a full data set when you do retraining, not increments. We have now overwritten the pre-trained pipeline with a very badly trained pipeline (only 2 rows of training data), so this example is not particularly useful except to walk through how to code up the training process.
If you want some decent sized datasets for sentiment analysis there are plenty out there, but for the best quality results use data you have curated from the domain where you want to apply machine learning.
Theres a lot more to Spark NLP, including training and using deep learning models which I plan to share in future articles, however for now you should now be able to integrate the Spark NLP library into your own Java application, create and train simple Machine Learning pipelines and use them to derive insights from text data.
Its great that an Java application developer can get started with NLP-based Machine Learning without too much difficulty, but if you are considering using Spark NLP, its important to evaluate the impact of using Java 8 before you seriously consider using this for a production project.
As we added more functionality to our application we came across more and more dependency and compatibility issues and had to downgrade Guava and Jackson versions among other libraries.
It’s also a pain to mangle Scala objects in Java and finally, remember Spark is designed for heavy loads and distributed computing — it’s not really intended to train or score large datasets within a single lightweight microservice.
For these reasons it might be best to plan to isolate the training part of your application in a separate Scala-based service, with access to a Spark cluster for the heavy lifting.
Many thanks to the Spark NLP team at John Snow Labs for making their work open source and easy to integrate. If you want to know more, check the website, documentation and repo with pretrained models and pipelines. They also operate a slack channel: spark-nlp.slack.com