Installing Spark NLP and Spark OCR in air-gapped networks (offline mode)

Published in

spark-nlp

7 min readMay 4, 2021

How to instal & setup Spark NLP in high-compliance environments with no internet connection and download & load the pretrained models locally.

Introduction

Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 5 million times and experiencing 16x growth for the last 16 months, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise. It has an active community and rich resources that you can find more information and code samples.

Data Science projects in high-compliance industries, like healthcare and life science, often require processing Protected Health Information (PHI). This may happen because the nature of the projects does not allow full de-identification in advance. In such scenarios, the alternative is to create an “AI cleanroom” — an isolated, hardened, air-gap environment where the work happens.

Online mode

(* all of the steps below are tested on Linux based operation systems, namely Ubuntu, RHEL, and CentOS)

At first, we install the Spark NLP libraries as follows:

$ pip install spark-nlp==3.0.2$ python -m pip install --upgrade spark-nlp-jsl==3.0.2 --user --extra-index-url https://pypi.johnsnowlabs.com/$secret

The installation steps would require some other steps depending on the OS you have and you can find more information at this article. After installing the packages, starting the Spark session with Spark NLP is as easy as this:

spark = sparknlp.start()

If you are using the licensed version, then you can do this:

spark = sparknlp_jsl.start(secret_key)

This start() the function basically runs the following code block under the hood and prepares the Spark session with relevant packages that you would need to work with Spark NLP.

from pyspark.sql import SparkSessionspark = SparkSession.builder \
   .appName("Spark NLP Licensed") \
   .master("local[*]") \
   .config("spark.driver.memory", "16G") \
   .config("spark.serializer",   
   "org.apache.spark.serializer.KryoSerializer") \
   .config("spark.kryoserializer.buffer.max", "2000M") \
   .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark- 
   nlp_2.11:3.0.2") \ 
   .config("spark.jars", 
   "https://pypi.johnsnowlabs.com/"+secret+"/spark-nlp-jsl-
   3.0.2.jar").getOrCreate()

Offline mode

As you can see, the relevant jar packages are pulled from maven and pypi through internet. What if we have no internet connection and have to do all these in air-gapped networks? We can download all the relevant packages outside of our network and copy them over with a mounted disk.

Let's start with the installation steps. Assuming that you already have Python and Jdk 8 installed properly in your secure network,

Here are the steps on a machine having an internet connection:

In order to get trial keys for Spark NLP for Healthcare, fill the form at https://www.johnsnowlabs.com/spark-nlp-try-free/ and you will get your keys to your email in a few minutes.
Install AWS CLI to your local computer following the steps here for Linux and here for MacOS.
Then configure your AWS credentials as follows (set the AWS keys you got via free trial):

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE 
AWS Secret Access Key [None]: wJalXUtnFEMI/K7MENG/bPxRfiCYEXAMPLEKEY

4. Download the followings with AWS CLI to your local computer (given that the AWS keys you got through step #1 are properly set in your environment as explained above)

#find the whl file for your version at https://pypi.org/project/spark-nlp/3.0.2/#files$ wget https://files.pythonhosted.org/packages/44/b9/d74e5a2f0cf6bd19467f4509bc56d981230e1c1ffe81416b4da20eddfaf6/spark_nlp-3.0.2-py2.py3-none-any.whl$ aws s3 cp --region us-east-2 s3://auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.0.2.jar ~/spark-nlp-3.0.2.jar$ aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$jsl_secret/spark-nlp-jsl-3.0.2.jar ~/spark-nlp-jsl-3.0.2.jar$ aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$jsl_secret/spark-nlp-jsl/spark_nlp_jsl-3.0.2-py3-none-any.whl ~/spark_nlp_jsl-3.0.2-py3-none-any.whl
    
$ aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$jsl_ocr_secret/jars/spark-ocr-assembly-3.0.0-spark30.jar ~/spark-ocr-assembly-3.0.0-spark30.jar
    
$ aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$jsl_ocr_secret/spark-ocr/spark_ocr-3.0.0.spark30-py3-none-any.whl ~/spark_ocr-3.0.0.spark30-py3-none-any.whl$ wget -q https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop2.7.tgz

For more information, see the official installation links here and here.

Now, move all these files to your air-gapped network and do the followings:

$ pip install ~/spark_nlp-3.0.2-py2.py3-none-any.whl$ pip install ~/spark_nlp_jsl-3.0.2-py3-none-any.whl$ pip install ~/spark_ocr-3.0.0.spark30-py3-none-any.whl --no-dependencies$ tar -xvf ~/spark-3.0.2-bin-hadoop2.7.tgz$ export SPARK_HOME="~/spark-3.0.2-bin-hadoop2.7"$ export PATH=$SPARK_HOME/bin:$PATH

Now you can start your Spark session locally from these jars as follows:

Spark = SparkSession.builder \
   .appName("Spark NLP Licensed") \
   .master("local[*]") \
   .config("spark.driver.memory", "16G") \
   .config("spark.serializer",   
   "org.apache.spark.serializer.KryoSerializer") \
   .config("spark.kryoserializer.buffer.max", "2000M") \
   .config("spark.jars", 
   "~/spark-nlp-3.0.2.jar,~/spark-nlp-jsl-3.0.2.jar,~/spark-ocr-assembly-3.0.0-spark30.jar")\
   .getOrCreate()

Deep learning-based models in Spark NLP use /tmpfolder to extract themselves. If you get an error related to this folder (limited access or low size), try remounting it as follows:

mount /tmp -o remount,exec

You may also define an overflow space to increase the capacity of /tmp folder as follows:

mount -t tmpfs -o size=5G,mode=1777 overflow /tmp

If your system admin restricted the usage of /tmp folder as well, then you can reconfigure your Spark session to set another local path as a /tmp directory. In that case, add the following two lines to your config in SparkSession.builder code block.

.config('spark.driver.extraJavaOptions','-Djava.io.tmpdir=/mylocal/folder')\
.config('spark.executer.extraJavaOptions','-Djava.io.tmpdir=/mylocal/folder')

Alternative approach: Virtual environment

Due to dependencies, Spark OCR may need some more steps to be installed properly. But we can still handle the offline installation with a virtual environment (venv).

$ pip freeze >> requirements.txt$ python -m venv mySparkNLPEnv$ source mySparkNLPEnv/bin/activate$ cd mySparkNLPEnv# install the packages one by one (skip if any of them fails)
$ cat ../requirements.txt | xargs -n 1 pip install$ tar cvfhz ../mySparkNLPEnv.tgz ./$ cp ../mySparkNLPEnv.tgz /mounted_disk/mySparkNLPEnv.tgz# inside the air-gapped environment
$ cp /mounted_disk/mySparkNLPEnv.tgz ./mySparkNLPEnv.tgz$ tar -C mySparkNLPEnv -zxvf ./mySparkNLPEnv.tgz$ source mySparkNLPEnv/bin/activate$ export PATH="/local/mySparkNLPEnv/lib/python3.6/site-packages:$PATH"# if export is not persisted, do the same inside your python script
 >> import sys
 >> sys.path.append("/local/mySparkNLPEnv/lib/python3.6/site-packages")

Download & load the pretrained models & pipelines locally

Spark NLP has more than 4000 pretrained models and pipelines that could be used locally with no training. You can find the details and code snippets for all of them at John Snow Labs models hub. Normally, these models are loaded into your environment with no effort as follows:

ner_onto = NerDLModel.pretrained("onto_bert_base_cased", "en") \         
           .setInputCols(["document", "token", "embeddings"]) \          
           .setOutputCol("ner")

This code snippet basically says that load Onto NER model that is already trained with Bert embeddings. Then the ResourceDownloader from Spark NLP will look for the most up-to-date model for this model in S3 bucket and compare with what you already have your ~/home/cache_pretrained folder and download if you have the outdated version given your Spark NLP version installed.

If you want to change the default folder for cache_pretrained or cluster tmp, you can do the followings while starting your Spark session:

(see here for more details regarding Spark NLP configurations)

# option-1:.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") \
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")# option-2:# as the first cell in your notebook(and after restarting your kernel),with open("application.conf", "w") as f:                      
    f.write("sparknlp.settings.pretrained.cache_folder = /path/to/my/cache\n")

You will need an active internet connection to have this sync. What if you don’t have an internet connection or want to load all the models locally by skipping this sync process (it would be much faster to skip). Here is how you can do that.

Go to models hub and look for the model you need.
Select the model you found and you will see the model card that shows all the details about that model.
Hover the Download button on that page and you will see the download link from the S3 bucket. Either you download manually or just use AWS CLI as follows with your credentials:

$ aws s3 cp — region us-east-2 s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_en_2.6.0_2.4_1598340336670.zip$ mkdir bert_base_cased_en_2.6.0_2.4_1598340336670 $ unzip bert_base_cased_en_2.6.0_2.4_1598340336670.zip -d bert_base_cased_en_2.6.0_2.4_1598340336670

Unzip and load this model into the respective annotator by pointing to your local path.

ner_onto = NerDLModel.load("~/home/local_folder/bert_base_cased_en_2.6.0_2.4_1598340336670")\
.setInputCols(["document", "token", "embeddings"]) \     .setOutputCol("ner")

Model card page for a pretrained model at Spark NLP Models Hub

See this notebook or this link to learn more about local usage.

Conclusion

In this blog post, we tried to walk you through how to install Spark NLP and Spark NLP Enterprise, and Spark OCR in air-gapped networks with no internet connection at all.

Now you can enjoy the power of Spark NLP with no internet connection without worrying about any sensitive data breach. You can even have an offline translation between hundreds of languages using MarianNMT in Spark NLP or deploy your own clinical document understanding pipeline using all the pre-trained models locally.