spark-nlp
·

Cleaning and extracting text from HTML/XML documents by using Spark NLP

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java, and Scala programming languages. The library obtained today the best performing academic peer-reviewed results for two years in a row with an important growing community (2.5M Downloads and 9x growth in 2020).

Photo by Florian Olivo on Unsplash

Some more impressive numbers from the latest 2.7.x release:

Today I’m going to talk about a new annotator that was added in the latest release: the DocumentNormalizer.

Why do we need a document normalizer?

Spark NLP community expressed the need for an annotator capable of directly processing input HTML/XML documents to clean or extract specific contents.

Imagine you aggregate a collection of raw HTML documents you just collected from a given data source with your preferred crawler library and you want to remove all the tags to focus on the tag contents.

Please don’t call the ghostbusters, just use the brand new Spark NLP DocumentNormalizer annotator! :D

But wait, what is an annotator? o.O
Let’s see the definition to have an idea.

In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm that can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator that trains on a DataFrame and produces a model. A Transformer is an algorithm that can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.

Let’s load some data to a text column in your input Spark SQL DataFrame:

path = "html-docs"

data = spark.sparkContext.wholeTextFiles(path)
df = data.toDF(schema=["filename", "text"]).select("text")

df.show()
...+--------------------+
| text|
+--------------------+
|<div class='w3-co...|
|<span style="font...|
|<!DOCTYPE html> <...|
+--------------------+

Once your input DataFrame is loaded you can define your next pipeline stages:

documentAssembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

inpuColName = "document"
outputColName = "normalizedDocument"

action = "clean"
cleanUpPatterns = ["<[^>]*>"]
replacement = " "
removalPolicy = "pretty_all"
encoding = "UTF-8"

documentNormalizer = DocumentNormalizer() \
.setInputCols(inpuColName) \
.setOutputCol(outputColName) \
.setAction(action) \
.setPatterns(cleanUpPatterns) \
.setReplacement(replacement) \
.setPolicy(removalPolicy) \
.setLowercase(True) \
.setEncoding(encoding)

Let’s make a pass on the different parameters we are going to set in our example.

Thanks to the new DocumentNormalizer annotator, we can now apply the regular expressions we have chosen in order to normalize the document and prepare it for the following stages.

In this example, our goal is therefore to extract the text that is contained within the HTML tags.

Having the DataFrame input column “text” containing the document,

let’s build and execute a simple pipeline with the following code:

...
documentAssembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
.setInputCols("document") \
.setOutputCol("normalizedDocument") \
.setAction("clean") \
.setPatterns(cleanUpPatterns) \
.setReplacement(" ") \
.setPolicy("pretty_all") \
.setLowercase(True)
docPatternRemoverPipeline = \
Pipeline() \
.setStages([
documentAssembler,
documentNormalizer
])

ds = docPatternRemoverPipeline.fit(df).transform(df)

ds.select("normalizedDocument").show(1, False)

to visualize the annotator action.

As we can see in the result below, the HTML tags content has been extracted, lowercased and formatted in the output column “normalizedDocument” and we can now use it as input of the next stages in a Spark NLP pipeline.

To provide a more complex example, the DocumentNormalizer annotator can be used as a text preparation step before the SentenceDetector followed by the Tokenizer as show in the following pipeline:

documentAssembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

cleanUpPatterns = ["<[^>]*>"]

documentNormalizer = DocumentNormalizer() \
.setInputCols("document") \
.setOutputCol("normalizedDocument") \
.setAction("clean") \
.setPatterns(cleanUpPatterns) \
.setReplacement(" ") \
.setPolicy("pretty_all") \
.setLowercase(True)

sentenceDetector = SentenceDetector() \
.setInputCols(["normalizedDocument"]) \
.setOutputCol("sentence")

regexTokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token") \
.fit(df)

docPatternRemoverPipeline = \
Pipeline() \
.setStages([
documentAssembler,
documentNormalizer,
sentenceDetector,
regexTokenizer])

ds = docPatternRemoverPipeline.fit(df).transform(df)

ds.select("normalizedDocument").show(10)
...

This pipeline is processing your HTML documents, applying the document normalization following your parameter settings, and chaining the cleaning action with sentence detector and regex tokenizer to provide in output clean tokens to further process.

Other interesting use cases in which the DocumentNormalizer can be useful:

action = "clean"
tag = "p"
patterns = ["<"+tag+"(.+?)>(.+?)<\\/"+tag+">"]
action = "clean"
tag = "p"
patterns = ["([^.@\\s]+)(\\.[^.@\\s]+)*@([^.@\\s]+\\.)+([^.@\\s]+)"]
replacement = "***OBFUSCATED PII***"
action = "extract"
tag = "name"
patterns = [tag]

Hope this article was useful! Have fun with the Spark NLP brand new release!

Further reference:

Tech Lead Data and AI, Senior Data Scientist in Fintech and Spark NLP contributor.