Text Cleaning: Standard Text Normalization with Spark NLP
An Introduction to Normalization Techniques in Spark NLP
TL;DR: The Normalizer annotator in Spark NLP performs text normalization on data. It is used to remove all dirty characters from text following a regex pattern, convert text to lowercase, remove special characters, punctuation, and transform words based on a provided dictionary.
The Normalizer annotator in Spark NLP is often used as part of a preprocessing step in NLP pipelines to improve the accuracy and quality of downstream analyses and models. It is a configurable component that can be customized based on specific use cases and requirements.
Normalization refers to the process of converting text to a standard format, which can help improve the accuracy and efficiency of subsequent natural language processing (NLP) tasks. Normalization typically involves removing or replacing certain characters or words, converting text to lowercase or uppercase, removing punctuation etc.
The Normalizer Annotator works by taking a token, and then applying a series of text transformations to it. These transformations can be configured by the user to meet their specific needs. Once the transformations have been applied, the output is returned as a normalized token.
P.S.: in Spark NLP, a token is defined as a basic unit of text that has been segmented from a larger body of text. Tokens are typically words, but they can also be punctuation, numbers, or other meaningful units of text. Tokenization is the process of breaking down text into tokens.
In this post, we will discuss how to use Spark NLP’s Normalizer
annotator to perform normalization efficiently.
Let us start with a short Spark NLP introduction and then discuss the details of the normalizing techniques with some solid results.
Introduction to Spark NLP
Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.
Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:
- A single unified solution for all your NLP needs
- Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
- The most widely used NLP library in industry (5 years in a row)
- The most scalable, accurate and fastest library in NLP history
Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.
Spark NLP processes the data using Pipelines
, structure that contains all the steps to be run on the input data:
Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.
An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.
Setup
To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:
pip install spark-nlp
pip install pyspark
For other installation options for different environments and machines, please check the official documentation.
Then, simply import the library and start a Spark session:
import sparknlp
# Start Spark Session
spark = sparknlp.start()
Normalizer
In NLP, a Normalizer is a component or module that is used to transform input text into a standard or normalized form. This is typically done to make text more consistent and easier to process.
Normalizer
annotator in Spark NLP expects TOKEN
as input, and then will provide TOKEN
as output. This means that, input data will be in the form of individual tokens, and the output (the predictions) will be in the form of individual tokens (normalized) as well.
Thus, we need the previous steps to generate the tokens that will be used as an input to our annotator.
# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline, LightPipeline
from sparknlp.annotator import (
Tokenizer,
Normalizer
)
import pyspark.sql.functions as F
# Step 1: Transforms raw texts to `document` annotation
document_assembler = (
DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
)
# Step 2: Gets the tokens of the text
tokenizer = (
Tokenizer()
.setInputCols(["document"])
.setOutputCol("token")
)
# Step 3: Normalizes the tokens
normalizer = (
Normalizer()
.setInputCols(["token"])
.setOutputCol("normalized")
)
pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer])
We use a sample text to see the performance of the Normalizer
annotator, then fit and transform the dataframe to get the results:
# Create a dataframe from the sample_text
text = "John and Peter are brothers.
However they don't support each other that much.
John is 20 years old and Peter is 26"
data = spark.createDataFrame([[text]]) \
.toDF("text")
# Fit and transform to get a prediction
model = pipeline.fit(data)
result = model.transform(data)
# Display the results
result.selectExpr("normalized.result").show(truncate = False)
The default settings of the Normalizer
annotator will only keep alphabet letters ([^A-Za-z]), so the numbers in the sample text are removed.
Using LightPipeline
LightPipeline is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. This means, we do not input a Spark Dataframe, but a string or an Array of strings instead, to be annotated.
We can show the results in a Pandas DataFrame by running the following code:
import pandas as pd
light_model = LightPipeline(model)
light_result = light_model.fullAnnotate(text)
results_tabular = []
for res in light_result[0]["normalized"]:
results_tabular.append(
(
res.begin,
res.end,# res.result.token,
res.result,
)
)
pd.DataFrame(results_tabular, columns=["begin", "end", "chunk"])
Additional Parameters
We can include additional parameters to the Normalizer in order to improve the results. For example, CleanupPatterns is a feature that allows you to specify a set of regular expressions to clean and preprocess your text data. By default, the annotator keeps only the alphabet letters, but you can create your own set of CleanupPatterns to include specific characters or patterns that you want.
In the code snippet below, (["""[^\w\d\s]"""])
removes all non-word, non-digit and non-space characters.
Also, the code will make sure that the normalized tokens will be lower case, and only the words with 3 or 4 characters will be selected.
normalizer = (
Normalizer()
.setInputCols(["token"])
.setOutputCol("normalized")
.setCleanupPatterns(["""[^\w\d\s]"""])
.setLowercase(True)
.setMaxLength(4)
.setMinLength(3) #Default = 0
)
pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer])
# Run the same dataframe as above and display the results
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("normalized.result").show(truncate = False)
And the Light Pipeline results will be produced by:
light_model = LightPipeline(model)
light_result = light_model.fullAnnotate(text)
results_tabular = []
for res in light_result[0]["normalized"]:
results_tabular.append(
(
res.begin,
res.end,
res.result,
))
pd.DataFrame(results_tabular, columns=["begin", "end", "chunk"])
Adding a Slang Dictionary
Another useful tool, especially when working with informal language or text from social media, is the option to use the setSlangDictionary parameter. It allows you to provide a list of words that are commonly used in a specific context or domain, but may not be recognized by standard language models. By providing this list, you can ensure that the model is able to accurately process and understand the text.
Here, we first create a csv file using the slangs list defined below, save the file:
slang_text = """Hey, what's good fam? I'm just chillin with my squad.
We're about to get to this dope party."""
#Create a Demo Slang CSV File
import csv
field_names = ['Slang', 'Correct_Word']
slangs = [
{'Slang': "fam", 'Correct_Word': 'friends'},
{'Slang': "chillin", 'Correct_Word': 'relaxing'},
{'Slang': "squad", 'Correct_Word': 'group of friends'},
{'Slang': "dope", 'Correct_Word': 'excellent'}
]
with open('slangs.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames = field_names)
writer.writeheader()
writer.writerows(slangs)
Now, use the setSlangDictionary parameter to define the file including the slang words:
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized") \
.setSlangDictionary("/content/slangs.csv" ,",")
pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
normalizer
])
data = spark.createDataFrame([[slang_text]]) \
.toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("normalized.result").show(truncate = False)
And the Light Pipeline results will be produced by:
light_model = LightPipeline(model)
light_result = light_model.fullAnnotate(slang_text)
results_tabular = []
for res in light_result[0]["normalized"]:
results_tabular.append(
(
res.begin,
res.end,
res.result,
))
pd.DataFrame(results_tabular, columns=["begin", "end", "chunk"])
One-liner alternative
In October 2022, John Snow Labs released the open-source johnsnowlabs
library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all John Snow Lab’s libraries, and can be installed with pip
:
pip install johnsnowlabs
Please check the official documentation for more examples and usage of this library.
To normalize a text with one line of code, we can simply run the following code
from johnsnowlabs import nlp
text = '@CKL_IT says that #normalizers are pretty useful to clean
#structured_strings in #NLU like tweets'
nlp.load('norm').predict(text)
The one-liner is based on default models for each NLP task. Depending on your requirements, you may want to use the one-liner for simplicity or customizing the pipeline to choose specific models that fit your needs.
NOTE: when using only the
johnsnowlabs
library, make sure you initialize the spark session with the configuration you have available. Since some of the libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session withspark = nlp.start(nlp=False)
. The default parameters for the start function includes using the licensed Healthcare NLP library withnlp=True
, but we can set that toFalse
and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.
For additional information, please check the following references.
- Documentation : Normalizer
- Python Docs : Normalizer
- Scala Docs : Normalizer
- For extended examples of usage, see the Spark NLP Workshop repository.
Conclusion
Normalizer
in Spark NLP is a useful annotator that performs text normalization on data. It is an important preprocessing step in NLP pipelines that can help improve the accuracy and quality of downstream analyses and models. By standardizing the text data, the Normalizer
annotator can reduce the noise and variability in the text, making it easier to analyze and extract meaningful information.
The Normalizer
annotator in Spark NLP is a configurable component that can be customized based on specific use cases and requirements. It offers a variety of options for text normalization, such as converting text to lowercase, removing special characters, punctuation, and stop words, and even defining a slang dictionary.