Spark NLP: Installation on Mac and Linux

Veysel Kocaman
spark-nlp
Published in
10 min readOct 8, 2019

This is the second article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows.

This blogpost covers installation on MacOS and Linux machines, but I also shared the links below for Windows and Docker. If you want to instal & setup Spark NLP in high-compliance environments with no internet connection, we have another blogpost for that.

Updated: May 15, 2021 (for Spark NLP 3.x)

Photo by Kevin Horvat on Unsplash

In our first article, we made a nice intro to Spark NLP and its basic components and concepts. If you haven’t read the first part yet, please read that at first.

Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It is written in Scala, and it includes Scala and Python APIs for use from Spark. It has no dependency on any other NLP or ML library. For those who haven’t worked with Apache Spark itself, the installation part would be a little bit tiresome and tricky. Let’s see how you can install Spark NLP in different environments. Most of the steps discussed here are already in official Github repo and documentation.

1. Installing Spark and Java

Spark NLP 3.0.0 is built on top of Apache Spark 3.0.0 and above and you should have at least this version installed. We suggest using Spark NLP 3.0.0 with Spark 3.0.2 which is the most stable version released so far. If you already installed Spark, please check its version. If all looks fıne, you can skip this section.

$ spark-submit --versionWelcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.2
/_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292

If you cannot upgrade your existing Apache Spark installation and still want to try Spark NLP, there are some previous versions of Spark NLP that are compatible with Apache Spark 2.3.x and 2.4.x at most.

Assuming that you haven’t installed Apache Spark yet, let’s start with Java installation at first. There are many ways to install Java on your computer and the most important thing to pay attention to its version.

The latest version of Java (at the time of writing this article), is Java 16, and Apache Spark has not officially supported Java 16 ! So do not try to install Java as it will install the latest version of Java and that imposes many issues! And when we say Java, we mean jdk (java development kit); not Java itself.

We suggest that you install jdk 8. To install, please go to the official website and from “Java SE Development Kit 8u292”, choose the one that works for your OS and then install it locally

Available JDK packages for v8

Actually, if you are using Mac or Linux, you can still use Homebrew to install JDK v8, here are the steps.

First of all, we need to tap a brew repo. Execute the following command and it will add more repositories to brew.

$ brew tap AdoptOpenJDK/openjdk

After adding tap, let’s install OpenJDK using brew.

$ brew cask install adoptopenjdk8

jdk on Debian, Ubuntu, etc.

$ sudo apt-get install openjdk-8-jre

jdk on Fedora, Oracle Linux, Red Hat Enterprise Linux, etc.

$ su -c "yum install java-1.8.0-openjdk"

So, now we have installed JDK 8, we can check the installation with the following command.

$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)

Now Java has installed properly. Let’s check the java path.

$ which java
/usr/bin/java

You can also see which versions of Java are already installed.

$ /usr/libexec/java_home -VMatching Java Virtual Machines (2):

1.8.0_292, x86_64: "AdoptOpenJDK 8" /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home

1.8.0_191, x86_64: "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_191.jdk/Contents/Home

Here you see two versions. To set your Java path, you should edit your path settings. You’ll need to add the following line to your .bash_profile (or just type in terminal screen and then source it) to let the other apps know which version of Java will be used.

export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

Ok, now let’s install Apache Spark with homebrew again.

brew install apache-spark

Or you can just use pip or conda for that

>> pip install pyspark==3.0.2
or
>> conda install -c conda-forge pyspark

If you are not able to use Homebrew, you need to download Spark manually. follow the steps below. In that case, go to the Apache Spark website (link) and choose a Spark release, package type, download type and then download the latest tgz file.

http://spark.apache.org/downloads.html

Unzip the folder in your home directory using the following command.

tar -zxvf spark-3.0.2-bin-hadoop2.7.tgz

Next, we will edit our .bash_profile (MacOS) so we can open a spark notebook in any directory.

nano .bash_profile

Don’t remove anything in your .bash_profile. Only add the following.

export SPARK_HOME=~/spark-3.0.2-bin-hadoop2.7

If you’re going to run Spark within Jupyter Notebook, you should also add these lines.

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

After saving and exiting, run this.

source .bash_profile

If Jupyter is already installed, you can just type “pyspark” on your terminal and you’ll see something like this

$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.2
/_/Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 SparkSession available as ‘spark’.

If you’re going to run Spark within Jupyter Notebook, it will open up a new Jupyter Notebook on your browser.

Here is a sample Python code that you can test your Spark environment.

import pyspark
import random
num_samples = 100000000sc = pyspark.SparkContext(appName=”Pi”).getOrCreate()def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()pi = 4 * count / num_samples
print(pi)
sc.stop()

(If you get an error saying that “pyspark module not found”, you can just do “pip install pyspark”)

In order to install Spark NLP and PySpark on Windows, you can just follow the steps detailed at this link. For Docker setup, see this one.

2. Installing Spark NLP

Python

It’s as easy as follows:

pip install spark-nlp==3.0.3

or with conda

conda install -c johnsnowlabs spark-nlp

The easiest way to get started is to run the following code in your favorite IDE.

import sparknlpsparknlp.start()

It will basically start a Spark session with Spark NLP support. After waiting some seconds you should see something like this in your notebook:

Version
V3.0.2
Master
local[*]
AppName
Spark NLP

With these lines of code, you have successfully started a Spark Session and are ready to use Spark NLP.

If you want to start your Jupyter Notebook with pyspark from your terminal without installing Spark NLP, you can also activate Spark NLP like this:

$ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3

But you usually don’t need to do this. You can just start your jupyter notebook as you have been before with no additional params and then start typing below:

import sparknlpsparknlp.start()

When we run sparknlp.start(), let’s see what’s going on under the hood. This command basically creates a SparkSession with necessary Spark NLP packages. So, instead of sparknlp.start() or if you need more fine-tuning with some custom parameters, you can start SparkSession in your python program manually. Here is a sample code to show how we initiated a Spark session with custom executor memory and serializer max buffer size.

from pyspark.sql import SparkSessionspark = SparkSession.builder \
.appName("Spark NLP") \
.master("local[*]") \
.config("spark.driver.memory", "16G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3").getOrCreate()

Then you can test your installation with the following code block.

import sparknlp
from sparknlp.pretrained import PretrainedPipeline
#create or get Spark Sessionspark = sparknlp.start()sparknlp.version()
spark.version
#download, load, and annotate a text by pre-trained pipelinepipeline = PretrainedPipeline(‘recognize_entities_dl’, ‘en’)
result = pipeline.annotate(‘Harry Potter is a great movie’)

You can also watch this video to learn more :

How to Setup Spark OCR on UBUNTU and Write your first code

Scala

You can start a spark REPL with Scala by running in your terminal a spark-shell including the JohnSnowLabs:spark-nlp:3.0.3 package:

$ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3

If you are running Scala on Jupyter Notebook through some 3rd party kernels (e.g. spylon), you can initialize Spark with Spark NLP support with the following code:

%%init_spark
// Configure Spark to use a local master
launcher.master = “local[*]” // optional
launcher.driver_memory = “4G” // optional
launcher.packages = [“JohnSnowLabs:spark-nlp_2.12:3.0.3”]

Databricks cloud cluster & Apache Zeppelin

Add the following maven coordinates in the dependency configuration page:

com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3

For Apache Zeppelin, given that you already have Spark and Java installed properly as discussed above, you can install it with the following command:

brew install apache-zeppelin

and you'll see Zeppelin installed here :

/usr/local/Cellar/apache-zeppelin

Then, in conf/zeppelin-env.sh, you may need to setup SPARK_SUBMIT_OPTIONS utilizing –packages instruction shown above like this. Just add this line to conf/zeppelin-env.sh

export SPARK_SUBMIT_OPTIONS=" — packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3"

To run & start Zeppelin and create & stop a Zeppelin notebook, run the following commands anywhere:

$ zeppelin-daemon.sh start -> To start the Daemon
$ zeppelin-daemon.sh stop -> To stop the Daemon
$ zeppelin-daemon.sh restart -> To restart the Daemon

Your Zeppelin Notebook should be accessible from the following link : http://localhost:8080/

If you're using Scala, I personally suggest using that in Zeppelin rather than Jupyter notebook. More information regarding Zeppelin installation is here and here.

Databricks Cloud Cluster

Create a cluster if you don’t have one already. On a new cluster or existing one add the following to the Advanced Options -> Spark tab in Spark.Config box:

spark.local.dir /var
spark.kryoserializer.buffer.max 1000M
spark.serializer org.apache.spark.serializer.KryoSerializer

In Libraries tab inside your cluster you need to follow these steps:

  • Install New -> PyPI -> spark-nlp==3.0.3 -> Install
  • Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3 -> Install

Now you can attach your notebook to the cluster and use Spark NLP! For more information, see here

Google Colab

Google Colab is perhaps the easiest way to get started with Spark NLP. It requires no installation or set up other than having a Google account.

If you are using Spark 3.x, run the following code in Google Colab notebook and start using Spark NLP right away.

!pip install pyspark
!pip install spark-nlp
import sparknlp

spark = sparknlp.start()

For Spark 2.x, the procedure is a little bit different due to Java dependencies.

# This is only to setup PySpark and Spark NLP on Colab!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

import sparknlp

spark = sparknlp.start()
# for GPU training >> sparknlp.start()

This line sets your colab notebook with the latest version of pyspark and spark-nlp. If you want to use another version, you can use this one.

! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/scripts/colab/colab_setup.sh!bash colab_setup.sh -p 3.1.1 -s 3.0.3

# -p is for pyspark
# -s is for spark-nlp
# by default they are set to the latest

Here is a live demo on Google Colab that performs sentiment analysis and NER using pre-trained Spark NLP models.

You can also watch this video to learn more :

How to setup Spark NLP on Colab and write your first code

GPU support

If you want to start using Spark NLP on a GPU machine, you can just start the spark session with this

sparknlp.start(gpu=True)

or starting manually

# Install the necessary Nvidia drivers, CUDA, etc$ sbt -Dis_gpu=true assembly,# Get FAT-JAR from repo# start a spark-shell — jars with that jar$ spark-shell — jars spark-nlp.jar

3. Getting started with Spark NLP

Let’s get started with a straightforward example to get a sneak peek into Spark NLP as well as make sure about the installation steps.

We’re going to use a pre-trained pipeline that’s explained in the first article. The pipeline used here handles most of the NLP tasks that could be applied to a text with several Spark NLP annotators.

# Import Spark NLP 
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.embeddings import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
# Start Spark Session with Spark NLP
spark = sparknlp.start()
# Download a pre-trained pipeline
pipeline = PretrainedPipeline(‘explain_document_dl’, lang=’en’)
# Your testing dataset
text = “””
The Mona Lisa is a 16th century oil painting created by Leonardo.
It’s held at the Louvre in Paris.
“””
# Annotate your testing dataset
result = pipeline.annotate(text)
# What’s in the pipeline
list(result.keys())
>> Output: [‘entities’, ‘stem’, ‘checked’, ‘lemma’, ‘document’,
>> ‘pos’, ‘token’, ‘ner’, ‘embeddings’, ‘sentence’]
# Check the results>> result[‘entities’]>> Output: [‘Mona Lisa’, ‘Leonardo’, ‘Louvre’, ‘Paris’]

4. Conclusion

In this article, we tried to cover all the necessary steps to install Spark NLP on various platforms. It’s quite normal that you may have some issues at the beginning but with just a little bit of digging and googling, we are sure that you’ll be able to get it running. Sometimes you may end up with a long list of errors probably related to your dev environment, and mostly with Java. If you somehow get stuck and need help, you can visit the previous issues on Github or just join our slack channel. You’ll always find someone who can answer your questions.

In the next article, we are planning to start writing about annotators and transformers in Spark NLP with hands-on examples. Since you just installed Spark NLP, now you are ready to start using it. To see more use cases and samples, feel free to visit Spark NLP workshop repository.

Resources

--

--