Matching Regex Patterns Using Spark NLP

Introduction

7 min readJan 26, 2023

In this post, we will delve into the power of the Spark NLP RegexMatcher annotator. The RegexMatcher annotator is a powerful tool for extracting information from text using regular expressions, and it allows you to define custom patterns to match against the text, making it a versatile and flexible option for a wide range of use cases.

One of the key benefits of using the RegexMatcher annotator is that it allows you to extract information that may not be covered by other built-in annotators. For example, you could use it to extract specific types of information, such as email addresses or phone numbers, that are not extracted by the built-in named entity recognition annotators.

In this post, we will explore the various parameters of the RegexMatcher annotator, and provide examples of how to use it to extract various types of information from text. We will cover how to set one or more regex rules and assign an identifier for each regex rule, and also how to create and use an external regex rules file. Additionally, we will show how to change the matching strategy of the RegexMatcher annotator to suit the specific needs of your use case.

📜 Background

RegexMatcher uses rules to match a set of regular expressions and associate them with a provided identifier.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be d{4}/dd/dd,date which will match strings like 1970/01/01 to the identifier date.

Rules must be provided by either setRules() (followed by setDelimiter()) or an external rules file.

To use an external file, a dictionary of predefined regular expressions must be provided with setExternalRules().

🎬 Setup

pip install spark-nlp

import sparknlp
from sparknlp.annotator import DocumentAssembler, RegexMatcher
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline

# Start Spark Session
spark = sparknlp.start()

🖨️ Inputs/Output Annotation Types:

Input Annotation types: DOCUMENT
Output Annotation type: CHUNK

🔎 Parameters

rules: (StringArrayParam) Rules with regex pattern and identifiers for matching.
externalRules: (StringArrayParam) External resource to rules, needs 'delimiter' in options.
delimiter: (String) Delimiter for rules provided with setRules.
strategy: (String) Strategy for which to match the expressions. (Default: "MATCH_ALL")

`Setting The Regex Patterns with setRules Parameter`

Here \d{4}\/\d\d\/\d\d,date is a date rule. In this rule, regex_pattern and the identifier is separated by the delimiter ,.

We need to add this rule to setRules() and provide the delimiter by setDelimiter().

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
regex_matcher = RegexMatcher()\
    .setRules(["\d{4}\/\d\d\/\d\d,date", "\s\d{2}\/\d\d\/\d\d,short_date"]) \
    .setDelimiter(",") \
    .setInputCols(["document"]) \
    .setOutputCol("matched_text") \
    .setStrategy("MATCH_ALL")
    

nlpPipeline = Pipeline(stages=[documentAssembler,regex_matcher])

text_list = ["Today is 2010/10/10.",  
             "She was born on 1966/02/03.", 
             "The project started on 89/01/01 and ended on 89/04/25."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")


result = nlpPipeline.fit(spark_df).transform(spark_df)result.select('text','matched_text.result').show(truncate=120)
+------------------------------------------------------+----------------------+
|                                                  text|                result|
+------------------------------------------------------+----------------------+
|                                  Today is 2010/10/10.|          [2010/10/10]|
|                           She was born on 1966/02/03.|          [1966/02/03]|
|The project started on 89/01/01 and ended on 89/04/25.|[ 89/01/01,  89/04/25]|
+------------------------------------------------------+----------------------+

result.select('matched_text').show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|matched_text                                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 9, 18, 2010/10/10, {identifier -> date, sentence -> 0, chunk -> 0}, []}]                                                                                             |
|[{chunk, 16, 25, 1966/02/03, {identifier -> date, sentence -> 0, chunk -> 0}, []}]                                                                                            |
|[{chunk, 22, 30,  89/01/01, {identifier -> short_date, sentence -> 0, chunk -> 0}, []}, {chunk, 44, 52,  89/04/25, {identifier -> short_date, sentence -> 0, chunk -> 1}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Showing the results with its identifier.

result.select(F.explode(F.arrays_zip(result.matched_text.result, 
                                     result.matched_text.metadata)).alias("cols")) \
       .select(F.expr("cols['0']").alias("Matches Found"),
               F.expr("cols['1']['identifier']").alias("identifier")).show()

+-------------+----------+
|Matches Found|identifier|
+-------------+----------+
|   2010/10/10|      date|
|   1966/02/03|      date|
|     89/01/01|short_date|
|     89/04/25|short_date|
+-------------+----------+

`Providing an External Regex Patterns' File Using setExternalRules Parameter`

To use an external file that includes a dictionary of predefined regular expressions, we must use setExternalRules(). The dictionary can be set in the form of a delimited text file, the same as setRules().

rules = '''
Quantum\s\w+, started with 'Quantum'
\w+\smillion, followed with 'million'
[A-Z]{2}[A-Z]*, all capital words
\w*ly\b, ending with 'ly'
\S*\d+\S*, match any word that contains numbers
\$\d+, money related numbers
'''

with open('regex_rules.txt', 'w') as f:    
    f.write(rules)

text_list = ["""Quantum computing is the use of quantum-mechanical phenomena such as superposition and entanglement to perform computation. 
                Computers that perform quantum computations are known as quantum computers. 
                Quantum computers are believed to be able to solve certain computational problems, such as integer factorization (which underlies RSA encryption), substantially faster than classical computers. 
                The study of quantum computing is a subfield of quantum information science. Quantum computing began in the early 1980s, when physicist Paul Benioff proposed a quantum mechanical model of the Turing machine. 
                Richard Feynman and Yuri Manin later suggested that a quantum computer had the potential to simulate things that a classical computer could not. 
                In 1994, Peter Shor developed a quantum algorithm for factoring integers that had the potential to decrypt RSA-encrypted communications. 
                Despite ongoing experimental progress since the late 1990s, most researchers believe that "fault-tolerant quantum computing is still a rather distant dream." 
                In recent years, investment into quantum computing research has increased in both the public and private sector. 
                On 23 October 2019, Google AI, in partnership with the U.S. National Aeronautics and Space Administration (NASA), published a paper in which they claimed to have achieved quantum supremacy. 
                While some have disputed this claim, it is still a significant milestone in the history of quantum computing.""",
             
             """Instacart has raised a new round of financing that makes it one of the most valuable private companies in the U.S., leapfrogging DoorDash, Palantir and Robinhood. 
                Amid surging demand for grocery delivery due to the coronavirus pandemic, Instacart has raised $225 million in a new funding round led by DST Global and General Catalyst. 
                The round increases Instacart’s valuation to $13.7 billion, up from $8 billion when it last raised money in 2018.""",

            """Quantum computing"""
            ]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

Below is the pipeline. We need to define the path and delimiter parameters in setExternalRules() function.

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
regex_matcher = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("matched_text")\
    .setExternalRules(path='regex_rules.txt', delimiter=',')
    

nlpPipeline = Pipeline(stages=[documentAssembler, 
                                 regex_matcher])

result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('text','matched_text.result').show(truncate=100)
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                                text|                                                                                              result|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|Quantum computing is the use of quantum-mechanical phenomena such as superposition and entangleme...|[Quantum computing, Quantum computers, Quantum computing, RSA, RSA, AI, NASA, substantially, earl...|
|Instacart has raised a new round of financing that makes it one of the most valuable private comp...|                                   [225 million, DST, Cataly, $225, $13.7, $8, 2018., $225, $13, $8]|
|                                                                                   Quantum computing|                                                                                 [Quantum computing]|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+

Display the results with the identifier

result.select(F.explode(F.arrays_zip(result.matched_text.result, 
                                     result.matched_text.metadata)).alias("cols")) \
       .select(F.expr("cols['0']").alias("Matches Found"),
               F.expr("cols['1']['identifier']").alias("matching_regex/string"),).show(25,truncate=False)

+-----------------+------------------------------------+
|Matches Found    |matching_regex/string               |
+-----------------+------------------------------------+
|Quantum computing|started with 'Quantum'              |
|Quantum computers|started with 'Quantum'              |
|Quantum computing|started with 'Quantum'              |
|RSA              |all capital words                   |
|RSA              |all capital words                   |
|AI               |all capital words                   |
|NASA             |all capital words                   |
|substantially    |ending with 'ly'                    |
|early            |ending with 'ly'                    |
|1980s,           |match any word that contains numbers|
|1994,            |match any word that contains numbers|
|1990s,           |match any word that contains numbers|
|23               |match any word that contains numbers|
|2019,            |match any word that contains numbers|
|225 million      |followed with 'million'             |
|DST              |all capital words                   |
|Cataly           |ending with 'ly'                    |
|$225             |match any word that contains numbers|
|$13.7            |match any word that contains numbers|
|$8               |match any word that contains numbers|
|2018.            |match any word that contains numbers|
|$225             |money related numbers               |
|$13              |money related numbers               |
|$8               |money related numbers               |
|Quantum computing|started with 'Quantum'              |
+-----------------+------------------------------------+

`Setting Matching Behaviour with setStrategy`

setStrategy() sets matching strategy, by default MATCH_ALL.

It can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE.

MATCH_FIRST: gets the first match of each rule
MATCH_ALL: gets all matches of each rule
MATCH_COMPLETE: gets matches if complete match of input

RegexMatcher().extractParamMap()

{Param(parent='RegexMatcher_e4ecde7dc69c', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='RegexMatcher_e4ecde7dc69c', name='strategy', doc='MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE'): 'MATCH_ALL'}

Let’s compare MATCH_FIRST & MATCH_ALL

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
regex_matcher_first = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_FIRST")\
    .setOutputCol("matched_text_first")\
    .setExternalRules(path='regex_rules.txt', delimiter=',')

regex_matcher_all = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("matched_text_all")\
    .setExternalRules(path='regex_rules.txt', delimiter=',')


nlpPipeline = Pipeline(stages=[documentAssembler, 
                                 regex_matcher_first,
                               regex_matcher_all
                               ])

result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select(result.matched_text_first.result, result.matched_text_all.result).show(truncate=120)

+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                      matched_text_first.result|                                                                                                 matched_text_all.result|
+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|[Quantum computing, RSA, substantially, 1980s,]|[Quantum computing, Quantum computers, Quantum computing, RSA, RSA, AI, NASA, substantially, early, 1980s,, 1994,, 19...|
|         [225 million, DST, Cataly, $225, $225]|                                                       [225 million, DST, Cataly, $225, $13.7, $8, 2018., $225, $13, $8]|
|                            [Quantum computing]|                                                                                                     [Quantum computing]|
+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

Here in MATCH_FIRST only the first match of the rules is shown while in MATCH_ALL all matches are returned.

Now use of the MATCH_COMPLETE. With MATCH_COMPLETE, regex pattern should match to entire input.

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
regex_matcher_complete = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_COMPLETE")\
    .setOutputCol("matched_text_complete")\
    .setRules(["\d{4},year"])\
    .setDelimiter(",")

regex_matcher_all = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("matched_text_all")\
    .setRules(["\d{4},year"])\
    .setDelimiter(",") 


nlpPipeline = Pipeline(stages=[documentAssembler, 
                            regex_matcher_complete,
                               regex_matcher_all
                               ])

text_list = ["2010",  "She was born on 1966/02/03.", 
             "The project started in 2001."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = nlpPipeline.fit(spark_df).transform(spark_df)

result.select('text','matched_text_all.result','matched_text_complete.result').show(truncate=120)
+----------------------------+------+------+
|                        text|result|result|
+----------------------------+------+------+
|                        2010|[2010]|[2010]|
| She was born on 1966/02/03.|[1966]|    []|
|The project started in 2001.|[2001]|    []|
+----------------------------+------+------+

In MATCH_COMPLETE, only the first document has a regex match. This is because the first document has only one string which is a year and MATCH_COMPLETE requires to match the entire input.

Conclusion

The Spark NLP RegexMatcher annotator is a powerful tool for extracting information from text using regular expressions. It allows you to define custom patterns to match against the text, making it a versatile and flexible option for a wide range of use cases.

In this post, we have covered the various parameters of the RegexMatcher annotator and provided examples of how to use it to extract various types of information from text. We have seen how to set one or more regex rules and assign an identifier for each regex rule, and also how to create and use an external regex rules file. Additionally, we have shown how to change the matching strategy of the RegexMatcher annotator to suit the specific needs of your use case.

Whether you are working on a text classification task, or a data analytics project, the RegexMatcher annotator is a powerful and valuable tool for extracting the information you need from your text data.

🔗 Call to action:

Documentation: RegexMatcher
Python Docs: RegexMatcher
Scala Docs: RegexMatcher
For extended examples of usage, see the Spark NLP Workshop repository.