Spark NLP: Installation on Mac and Linux (Part-II)

Veysel Kocaman
Oct 8 · 9 min read

* This is the second article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows.


Photo by Kevin Horvat on Unsplash

In our first article, we made a nice intro to Spark NLP and its basic components and concepts. If you haven’t read the first part yet, please read that at first.

Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It is written in Scala, and it includes Scala and Python APIs for use from Spark. It has no dependency on any other NLP or ML library. For those who haven’t worked with Apache Spark itself, the installation part would be a little bit tiresome and tricky. Let’s see how you can install Spark NLP in different environments. Most of the steps discussed here are already in official Github repo and documentation.


1. Installing Spark and Java

Spark NLP is built on top of Apache Spark 2.4.3 and you should have at least this version installed. If you already installed Spark, please check its version. If all looks fıne, you can skip this section.

$ spark-submit --version

If you cannot upgrade your existing Apache Spark installation and still want to try Spark NLP, there are some previous versions of Spark NLP that are compatible with Apache Spark 2.3.x at most.

Assuming that you haven’t installed Apache Spark yet, let’s start with Java installation at first. There are many ways to install Java on your computer and the most important thing to pay attention to its version.

The latest version of Java (at the time of writing this article), is Java 12, and Apache Spark has not officially supported Java 12 ! So do not try to use Homebrew to install Java as it will install the latest version of Java and that imposes many issues! And when we say Java, we mean jdk (java development kit); not Java itself.

We suggest that you install jdk 8. To install, please go to the official website and from “Java SE Development Kit 8u191”, choose the one that works for your OS and then install it locally

available JDK packages for v8

Actually, if you are using Mac or Linux, you can still use Homebrew to install JDK v8, here are the steps.

First of all, we need to tap a brew repo. Execute the following command and it will add more repositories to brew.

$ brew tap AdoptOpenJDK/openjdk

After adding tap, let’s install OpenJDK using brew.

$ brew cask install adoptopenjdk8

jdk on Debian, Ubuntu, etc.

$ sudo apt-get install openjdk-8-jre

jdk on Fedora, Oracle Linux, Red Hat Enterprise Linux, etc.

$ su -c "yum install java-1.8.0-openjdk"

So, now we have installed JDK 8, we can check the installation with the following command.

$ java -version
openjdk version "1.8.0_202"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_202-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.202-b08, mixed mode)

Now Java has installed properly. Let’s check the java path.

$ which java
/usr/bin/java

You can also see which versions of Java are already installed.

$ /usr/libexec/java_home -V

Here you see two versions. To set your Java path, you should edit your path settings. You’ll need to add the following line to your .bash_profile (or just type in terminal screen and then source it) to let the other apps know which version of Java will be used.

export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

Ok, now let’s install Apache Spark with homebrew again.

brew install apache-spark

Or you can just use pip or conda for that

>> pip install pyspark
or

If you are not able to use Homebrew, you need to download Spark manually. follow the steps below. In that case, go to the Apache Spark website (link) and choose a Spark release, package type, download type and then download the latest tgz file.

http://spark.apache.org/downloads.html

Unzip the folder in your home directory using the following command.

tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz

Next, we will edit our .bash_profile (MacOS) so we can open a spark notebook in any directory.

nano .bash_profile

Don’t remove anything in your .bash_profile. Only add the following.

export SPARK_HOME=~/spark-2.4.4-bin-hadoop2.7

If you’re going to run Spark within Jupyter Notebook, you should also add these lines.

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

After saving and exiting, run this.

source .bash_profile

If Jupyter is already installed, you can just type “pyspark” on your terminal and you’ll see something like this

$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_222

If you’re going to run Spark within Jupyter Notebook, it will open up a new Jupyter Notebook on your browser.

Here is a sample Python code that you can test your Spark environment.

import pyspark
import random

In order to install Spark on Windows, you can just follow the steps detailed at this link.


2. Installing Spark NLP

Python

It’s as easy as follows:

pip install spark-nlp==2.2.2

or with conda

conda install -c johnsnowlabs spark-nlp

The easiest way to get started is to run the following code in your favorite IDE.

import sparknlp

It will basically start a Spark session with Spark NLP support. After waiting some seconds you should see something like this in your notebook:

Version
V2.4.4
Master
local[*]
AppName
Spark NLP

With these lines of code, you have successfully started a Spark Session and are ready to use Spark NLP.

If you want to start your Jupyter Notebook with pyspark from your terminal without installing Spark NLP, you can also activate Spark NLP like this:

$ pyspark — packages JohnSnowLabs:spark-nlp:2.2.2

When we run sparknlp.start(), let’s see what’s going on under the hood. This command basically creates a SparkSession with necessary Spark NLP packages. So, instead of sparknlp.start() or if you need more fine-tuning with some custom parameters, you can start SparkSession in your python program manually. Here is a sample code to show how we initiated a Spark session with custom executor memory and serializer max buffer size.

from pyspark.sql import SparkSession

Then you can test your installation with the following code block.

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

Scala

You can start a spark REPL with Scala by running in your terminal a spark-shell including the JohnSnowLabs:spark-nlp:2.2.2 package:

$ spark-shell — packages JohnSnowLabs:spark-nlp:2.2.2

If you are running Scala on Jupyter Notebook through some 3rd party kernels (e.g. spylon), you can initialize Spark with Spark NLP support with the following code:

%%init_spark
// Configure Spark to use a local master
launcher.master = “local[*]” // optional
launcher.driver_memory = “4G” // optional
launcher.packages = [“JohnSnowLabs:spark-nlp:2.2.2”]

Databricks cloud cluster & Apache Zeppelin

Add the following maven coordinates in the dependency configuration page:

com.johnsnowlabs.nlp:spark-nlp_2.11:2.2.2

For Apache Zeppelin, given that you already have Spark and Java installed properly as discussed above, you can install it with the following command:

brew install apache-zeppelin

and you'll see Zeppelin installed here :

/usr/local/Cellar/apache-zeppelin

Then, in conf/zeppelin-env.sh, you may need to setup SPARK_SUBMIT_OPTIONS utilizing –packages instruction shown above like this. Just add this line to conf/zeppelin-env.sh

export SPARK_SUBMIT_OPTIONS=” — packages JohnSnowLabs:spark-nlp:2.2.2"

To run & start Zeppelin and create & stop a Zeppelin notebook, run the following commands anywhere:

$ zeppelin-daemon.sh start -> To start the Daemon
$ zeppelin-daemon.sh stop -> To stop the Daemon
$ zeppelin-daemon.sh restart -> To restart the Daemon

Your Zeppelin Notebook should be accessible from the following link : http://localhost:8080/

If you're using Scala, I personally suggest using that in Zeppelin rather than Jupyter notebook. More information regarding Zeppelin installation is here and here.

Google Colab

Google Colab is perhaps the easiest way to get started with Spark NLP. It requires no installation or set up other than having a Google account.

Run the following code in Google Colab notebook and start using Spark NLP right away.

import os

Here is a live demo on Google Colab that performs sentiment analysis and NER using pre-trained Spark NLP models.

GPU support

If you want to start using Spark NLP on a GPU machine, here are the steps you need to follow:

# Install the necessary Nvidia drivers, CUDA, etc

Spark NLP — OCR Module

If you are initializing the Spark context with Spark NLP

spark = sparknlp.start(include_ocr=True)

Or you can add the packages manually

from pyspark.sql import SparkSession

And if you start from the terminal, you will need to add the following to your start up commands:

$ pyspark — repositories http://repo.spring.io/plugins-release
— packages JohnSnowLabs:spark-nlp:2.2.2,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.2.2,javax.media.jai:com.springsource.javax.media.jai.core:1.1.3

3. Getting started with Spark NLP

Let’s get started with a straightforward example to get a sneak peek into Spark NLP as well as make sure about the installation steps.

We’re going to use a pre-trained pipeline that’s explained in the first article. The pipeline used here handles most of the NLP tasks that could be applied to a text with several Spark NLP annotators.

# Import Spark NLP 
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.embeddings import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

4. Conclusion

In this article, we tried to cover all the necessary steps to install Spark NLP on various platforms. It’s quite normal that you may have some issues at the beginning but with just a little bit of digging and googling, we are sure that you’ll be able to get it running. Sometimes you may end up with a long list of errors probably related to your dev environment, and mostly with Java. If you somehow get stuck and need help, you can visit the previous issues on Github or just join our slack channel. You’ll always find someone who can answer your questions.

In the next article, we are planning to start writing about annotators and transformers in Spark NLP with hands-on examples. Since you just installed Spark NLP, now you are ready to start using it. To see more use cases and samples, feel free to visit Spark NLP workshop repository.

Resources

spark-nlp

Natural Language Understanding Library for Apache Spark.

Veysel Kocaman

Written by

Senior Data Scientist and PhD Researcher in ML

spark-nlp

spark-nlp

Natural Language Understanding Library for Apache Spark.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade