Spark NLP 101: Document Assembler

Veysel Kocaman
Oct 14 · 6 min read

In this series, we are going to write a separate article for each annotator in the Spark NLP library and this is the first one.

Photo by Joshua Sortino on Unsplash

In our first article, remember that we talked about certain types of columns that each Annotator accepts or outputs. So, what are we going to do if our DataFrame doesn’t have columns in those types? Here come transformers.

In Spark NLP, we have five different transformers that are mainly used for getting the data in or transform the data from one AnnotatorType to another.

That is, the DataFrame you have needs to have a column from one of these types if that column will be fed into an annotator; otherwise, you’d need to use one of the Spark NLP transformers. Here is the list of transformers: DocumentAssembler, TokenAssembler, Doc2Chunk, Chunk2Doc, and the Finisher.

So, let’s start with DocumentAssembler(), an entry point to Spark NLP annotators.


Document Assembler

As discussed before, each annotator in Spark NLP accepts certain types of columns and outputs new columns in another type (we call this AnnotatorType). In Spark NLP, we have the following types: Document, token, chunk, pos, word_embeddings, date, entity, sentiment, named_entity, dependency, labeled_dependency.

To get through the process in Spark NLP, we need to get raw data transformed into Document type at first. DocumentAssembler() is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.

DocumentAssembler() comes from sparknlp.base class and has the following settable parameters. See the full list here and the source code here.

  • setInputCol() -> the name of the column that will be converted. We can specify only one column here. It can read either a String column or an Array[String]
  • setOutputCol() -> optional : the name of the column in Document type that is generated. We can specify only one column here. Default is ‘document’
  • setIdCol() -> optional: String type column with id information
  • setMetadataCol() -> optional: Map type column with metadata information
  • setCleanupMode() -> optional: Cleaning up options, possible values:
disabled: Source kept as original. This is a default.inplace: removes new lines and tabs.inplace_full: removes new lines and tabs but also those which were converted to strings (i.e. \n)shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.shrink_full: remove new lines and tabs, including stringified values, plus shrinking spaces and blank lines.

And here is the simplest form of how we use that.

Python

from sparknlp.base import *documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setCleanupMode("shrink")
doc_df = documentAssembler.transform(spark_df)

Scala

import com.johnsnowlabs.nlp._val documentAssembler = new DocumentAssembler()               
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val doc_df = documentAssembler.transform(spark_df)

At first, we define DocumentAssembler with desired parameters and then transform the data frame with it. The most important point to pay attention to here is that you need to use a String or String[Array] type column in .setInputCol(). So it doesn’t have to be named as text. You just use the column name as it is.

Example

Let’s see how it works in action. Assume that we have the following pandas data frame.

Now, we convert it into a Spark DataFrame and print the first 20 rows and its schema.

spark_df = spark.createDataFrame(df.astype(str))spark_df.show()>> 
+--------------------+
| text|
+--------------------+
|Genomic structure...|
|Late phase II cli...|
|Preoperative atri...|
|A method for remo...|
|Kohlschutter synd...|
|Selection of an a...|
|Conjugation with ...|
|Comparison of thr...|
|Salvage chemother...|
|State of the art ...|
|Monomeric sarcosi...|
|Visual recognitio...|
|A unified nomencl...|
|Novel synthesis a...|
|Medical roles in ...|
|Quantitative dete...|
|Circulating Level...|
|Ipsilateral head ...|
|Do subspecialized...|
|Expectant managem...|
+--------------------+
only showing top 20 rows
spark_df.printSchema()>> root
|-- text: string (nullable = true)

Now we call DocumentAssembler() to create another column in the Document type.

import sparknlpspark = sparknlp.start() # start spark sessionfrom sparknlp.base import *documentAssembler = DocumentAssembler()\
.setInputCol(“text”)\
.setOutputCol(“document”)
doc_df=documentAssembler.transform(spark_df)doc_df.show()>>
+--------------------+--------------------+
| text| document|
+--------------------+--------------------+
|Genomic structure...|[[document, 0, 62...|
|Late phase II cli...|[[document, 0, 14...|
|Preoperative atri...|[[document, 0, 88...|
|A method for remo...|[[document, 0, 88...|
|Kohlschutter synd...|[[document, 0, 33...|
|Selection of an a...|[[document, 0, 10...|
|Conjugation with ...|[[document, 0, 96...|
|Comparison of thr...|[[document, 0, 17...|
|Salvage chemother...|[[document, 0, 13...|
|State of the art ...|[[document, 0, 12...|
|Monomeric sarcosi...|[[document, 0, 10...|
|Visual recognitio...|[[document, 0, 70...|
|A unified nomencl...|[[document, 0, 12...|
|Novel synthesis a...|[[document, 0, 11...|
|Medical roles in ...|[[document, 0, 47...|
|Quantitative dete...|[[document, 0, 92...|
|Circulating Level...|[[document, 0, 82...|
|Ipsilateral head ...|[[document, 0, 78...|
|Do subspecialized...|[[document, 0, 10...|
|Expectant managem...|[[document, 0, 66...|
+--------------------+--------------------+
only showing top 20 rows

As you can see, the type of new column ‘document’ is created. Now let’s print the schema again to see the contents of this new column.

doc_df.printSchema()>>
root
|-- text: string (nullable = true)
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
| | |-- sentence_embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)

The new column is in an array of struct type and has the parameters shown above. The annotators and transformers all come with universal metadata that would be filled down the road depending on the annotators being used. Unless you want to append other Spark NLP annotators to DocumentAssembler(), you don’t need to know what all these parameters mean for now. So we will talk about them in the following articles. You can access all these parameters with {column name}.{parameter name}.

Let’s print out the first item’s result.

doc_df.select("document.result").take(1)>> [Row(result=['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.'])]

If we would like to flatten the document column, we can do as follows.

Python

import pyspark.sql.functions as Fdoc_df.withColumn(
"tmp",
F.explode("document")) \
.select("tmp.*"). \
show()

Scala

import org.apache.spark.sql.functions._
doc_df.withColumn("tmp", explode(col("chunk"))).select("tmp.*"). show()

Output

Conclusion

In this article, we introduced you to DocumentAssembler(), one of the most essential transformers of the Spark NLP library. It’s the entry point to get your data in, and then process further with annotators. And, without linking its output to annotators in a pipeline, it has no meaning. In the following articles, we will talk about how you can apply certain NLP tasks on top of DocumentAssembler().

We hope that you already read the previous articles on our official Medium page, and started to play with Spark NLP. Here are the links for the other articles. Don’t forget to follow our page and stay tuned!

Introduction to Spark NLP: Foundations and Basic Components (Part-I)

Introduction to Spark NLP: Installation and Getting Started (Part-II)

spark-nlp

Natural Language Understanding Library for Apache Spark.

Veysel Kocaman

Written by

Senior Data Scientist and PhD Researcher in ML

spark-nlp

spark-nlp

Natural Language Understanding Library for Apache Spark.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade