Spark NLP: Installation on Mac and Linux (Part-II)

Veysel Kocaman
Oct 8, 2019 · 9 min read

* This is the second article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows.


Photo by Kevin Horvat on Unsplash

In our first article, we made a nice intro to Spark NLP and its basic components and concepts. If you haven’t read the first part yet, please read that at first.

Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It is written in Scala, and it includes Scala and Python APIs for use from Spark. It has no dependency on any other NLP or ML library. For those who haven’t worked with Apache Spark itself, the installation part would be a little bit tiresome and tricky. Let’s see how you can install Spark NLP in different environments. Most of the steps discussed here are already in official Github repo and documentation.


1. Installing Spark and Java

Spark NLP is built on top of Apache Spark 2.4.3 and you should have at least this version installed. If you already installed Spark, please check its version. If all looks fıne, you can skip this section.

If you cannot upgrade your existing Apache Spark installation and still want to try Spark NLP, there are some previous versions of Spark NLP that are compatible with Apache Spark 2.3.x at most.

Assuming that you haven’t installed Apache Spark yet, let’s start with Java installation at first. There are many ways to install Java on your computer and the most important thing to pay attention to its version.

The latest version of Java (at the time of writing this article), is Java 12, and Apache Spark has not officially supported Java 12 ! So do not try to use Homebrew to install Java as it will install the latest version of Java and that imposes many issues! And when we say Java, we mean jdk (java development kit); not Java itself.

We suggest that you install jdk 8. To install, please go to the official website and from “Java SE Development Kit 8u191”, choose the one that works for your OS and then install it locally

available JDK packages for v8

Actually, if you are using Mac or Linux, you can still use Homebrew to install JDK v8, here are the steps.

First of all, we need to tap a brew repo. Execute the following command and it will add more repositories to brew.

After adding tap, let’s install OpenJDK using brew.

jdk on Debian, Ubuntu, etc.

jdk on Fedora, Oracle Linux, Red Hat Enterprise Linux, etc.

So, now we have installed JDK 8, we can check the installation with the following command.

Now Java has installed properly. Let’s check the java path.

You can also see which versions of Java are already installed.

Here you see two versions. To set your Java path, you should edit your path settings. You’ll need to add the following line to your .bash_profile (or just type in terminal screen and then source it) to let the other apps know which version of Java will be used.

Ok, now let’s install Apache Spark with homebrew again.

Or you can just use pip or conda for that

If you are not able to use Homebrew, you need to download Spark manually. follow the steps below. In that case, go to the Apache Spark website (link) and choose a Spark release, package type, download type and then download the latest tgz file.

http://spark.apache.org/downloads.html

Unzip the folder in your home directory using the following command.

Next, we will edit our .bash_profile (MacOS) so we can open a spark notebook in any directory.

Don’t remove anything in your .bash_profile. Only add the following.

If you’re going to run Spark within Jupyter Notebook, you should also add these lines.

After saving and exiting, run this.

If Jupyter is already installed, you can just type “pyspark” on your terminal and you’ll see something like this

If you’re going to run Spark within Jupyter Notebook, it will open up a new Jupyter Notebook on your browser.

Here is a sample Python code that you can test your Spark environment.

(If you get an error saying that “pyspark module not found”, you can just do “pip install pyspark”)

In order to install Spark on Windows, you can just follow the steps detailed at this link.


2. Installing Spark NLP

Python

It’s as easy as follows:

or with conda

The easiest way to get started is to run the following code in your favorite IDE.

It will basically start a Spark session with Spark NLP support. After waiting some seconds you should see something like this in your notebook:

With these lines of code, you have successfully started a Spark Session and are ready to use Spark NLP.

If you want to start your Jupyter Notebook with pyspark from your terminal without installing Spark NLP, you can also activate Spark NLP like this:

When we run sparknlp.start(), let’s see what’s going on under the hood. This command basically creates a SparkSession with necessary Spark NLP packages. So, instead of sparknlp.start() or if you need more fine-tuning with some custom parameters, you can start SparkSession in your python program manually. Here is a sample code to show how we initiated a Spark session with custom executor memory and serializer max buffer size.

Then you can test your installation with the following code block.

Scala

You can start a spark REPL with Scala by running in your terminal a spark-shell including the JohnSnowLabs:spark-nlp:2.2.2 package:

If you are running Scala on Jupyter Notebook through some 3rd party kernels (e.g. spylon), you can initialize Spark with Spark NLP support with the following code:

Databricks cloud cluster & Apache Zeppelin

Add the following maven coordinates in the dependency configuration page:

For Apache Zeppelin, given that you already have Spark and Java installed properly as discussed above, you can install it with the following command:

and you'll see Zeppelin installed here :

Then, in conf/zeppelin-env.sh, you may need to setup SPARK_SUBMIT_OPTIONS utilizing –packages instruction shown above like this. Just add this line to conf/zeppelin-env.sh

To run & start Zeppelin and create & stop a Zeppelin notebook, run the following commands anywhere:

Your Zeppelin Notebook should be accessible from the following link : http://localhost:8080/

If you're using Scala, I personally suggest using that in Zeppelin rather than Jupyter notebook. More information regarding Zeppelin installation is here and here.

Google Colab

Google Colab is perhaps the easiest way to get started with Spark NLP. It requires no installation or set up other than having a Google account.

Run the following code in Google Colab notebook and start using Spark NLP right away.

Here is a live demo on Google Colab that performs sentiment analysis and NER using pre-trained Spark NLP models.

GPU support

If you want to start using Spark NLP on a GPU machine, here are the steps you need to follow:

Spark NLP — OCR Module

If you are initializing the Spark context with Spark NLP

Or you can add the packages manually

And if you start from the terminal, you will need to add the following to your start up commands:


3. Getting started with Spark NLP

Let’s get started with a straightforward example to get a sneak peek into Spark NLP as well as make sure about the installation steps.

We’re going to use a pre-trained pipeline that’s explained in the first article. The pipeline used here handles most of the NLP tasks that could be applied to a text with several Spark NLP annotators.


4. Conclusion

In this article, we tried to cover all the necessary steps to install Spark NLP on various platforms. It’s quite normal that you may have some issues at the beginning but with just a little bit of digging and googling, we are sure that you’ll be able to get it running. Sometimes you may end up with a long list of errors probably related to your dev environment, and mostly with Java. If you somehow get stuck and need help, you can visit the previous issues on Github or just join our slack channel. You’ll always find someone who can answer your questions.

In the next article, we are planning to start writing about annotators and transformers in Spark NLP with hands-on examples. Since you just installed Spark NLP, now you are ready to start using it. To see more use cases and samples, feel free to visit Spark NLP workshop repository.

Resources

spark-nlp

Natural Language Understanding Library for Apache Spark.

Veysel Kocaman

Written by

Senior Data Scientist and PhD Researcher in ML

spark-nlp

spark-nlp

Natural Language Understanding Library for Apache Spark.

More From Medium

More from spark-nlp

More on Spark from spark-nlp

More on Spark from spark-nlp

Spark NLP 101: Document Assembler

More from Veysel Kocaman

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade