Setup Spark NLP on Databricks in 2 Minutes and get the taste of scalable NLP

The latest research in multiple refreshing flavors like XLNET, ALBERT, BERT, ELMO, and more!

What will we learn in this article?

Pumping value out of data efficiently with Spark NLP

What is Spark NLP and who uses it?

Winning awards every year since its release
Widely used by fortune 500 companies

Why is Spark NLP so widely used?

Easy and fast SOTA with Spark NLP
Batteries included! An arsenal of NLP weaponry

0. Login to Databricks or get an account

1. Create a cluster with the latest Spark version

Setup Spark cluster in Databricks for Spark NLP

2. Install Python Dependencies to cluster

Install Spark NLP Python dependencies to Databricks Spark cluster

3. Install Java Dependencies to cluster

com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.0
Install Spark NLP Java dependencies to Databricks Spark cluster
Cluster all ready for NLP, Spark and Python or Scala fun!

4. Let's test out our cluster real quick

from pyspark.sql.types import StringType
#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.base import *
#If you need to set any Spark config
spark.conf.set('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
#Create Dataframe with Sample data
dfTest = spark.createDataFrame([
"Spark-NLP would you be so nice and cook up some state of the art embeddings for me?",
"Tensorflow is cool but can be tricky to get Running. With Spark-NLP, you save yourself a lot of trouble.",
"I save so much time using Spark-NLP and it is so easy!"
], StringType()).toDF("text")
#Basic Spark NLP Pipelinedocument_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
elmo = ElmoEmbeddings.pretrained() \
.setInputCols(["token", "document"]) \
.setOutputCol("elmo") \
.setPoolingLayer("elmo")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
elmo,
])
nlp_model = nlpPipeline.fit(dfTest)
processed = nlp_model.transform(dfTest)
processed.show()
After running the pipeline your result should look like this

Congratulations!

We are done! Quick and easy,

Conclusion

What we have learned

Next Steps:

Problems?

Data Science, Big Data, Data Engineering, DevOps expert

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store