Extracting Dates From Text Using Spark NLP
Introduction
In this post, we will dive into the world of date extraction using the Spark NLP DateMatcher and MultiDateMatcher annotators. These powerful tools allow us to easily extract dates from text and offer a wide range of options for customizing the extraction process.
One of the key features of these annotators is the ability to extract dates in multiple languages. This makes them ideal for use in international applications, or for dealing with text in multiple languages.
Another important aspect of these annotators is their ability to deal with relative dates. This means that you can extract dates like “next Wednesday” or “last week” in addition to absolute dates like “January 1, 2022”.
In addition to these advanced features, the DateMatcher and MultiDateMatcher annotators also allow you to change the input and output date formats and even set the missing day in date without day.
Throughout this post, we will explore the differences between the DateMatcher and MultiDateMatcher annotators and describe all of their parameters in detail, complete with examples to help you get started using them.
By the end of this post, you will have a solid understanding of how to use these powerful Spark NLP annotators to extract dates from text, and you will be able to apply them to your own projects confidently.
So, let’s get started and dive into the world of date extraction with Spark NLP!
📜 Background
DateMatcher
and MultiDateMatcher
extract exact & normalized dates from relative date-time phrases and convert these dates to a provided date format. DateMatcher
can only extract one date per input document while MultiDateMatcher
can multiple dates.
Here are examples of some date entities that DateMatcher
and MultiDateMatcher
can match:
"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008", "Fri, 21 Nov 1997", "Jan 21, "97", "Sun", "Nov 21", "jan 1st", "next thursday", "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month", "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.", "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"
For example "The 31st of April in the year 2008"
will be converted into 2008/04/31
🎬 Setup
pip install spark-nlp
import sparknlp
from sparknlp.annotator import DocumentAssembler, DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline
spark = sparknlp.start()
spark
🖨️ Inputs/Output annotation types:
- Input Annotation types:
DOCUMENT
- Output Annotation type:
DATE
🔎 Parameters
A list of parameters that this annotator can take.
inputFormats
(StringArrayParam) : Date Matcher regex patterns.outputFormat
(String) : Output format of parsed date. (Default: "yyyy/MM/dd")anchorDateYear
(Int) : Add an anchor year for the relative dates.(Default: -1, which means current year)anchorDateMonth
(Int) : Add an anchor month for the relative dates.(Default: -1, which means current month)anchorDateDay
(Int) : Add an anchor day for the relative dates.(Default: -1, which means current day)defaultDayWhenMissing
(Int) : Which day to set when it is missing from parsed input. (Default: 1)readMonthFirst
(Boolean) : Whether to interpret dates as "MM/DD/YYYY" instead of "DD/MM/YYYY". (Default: True)sourceLanguage
(String) : Source language for explicit translation (Default: "en")
Comparing DateMatcher and MultiDateMatcher annotators
The below pipeline demonstrates the difference between DateMatcher
and MultiDateMatcher
annotators.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
date = DateMatcher() \
.setInputCols("document") \
.setOutputCol("date") \
.setOutputFormat("yyyy/MM/dd")
multiDate = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date") \
.setOutputFormat("MM/dd/yy")
pipeline = Pipeline().setStages([
documentAssembler,
date,
multiDate
])
text_list = ["See you on next monday.",
"She was born on 02/03/1966.",
"The project started yesterday and will finish next year.",
"She will graduate by July 2023.",
"She will visit doctor tomorrow and next month again."]
spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text","date.result as date", "multi_date.result as multi_date").show(truncate=False)
+--------------------------------------------------------+------------+--------------------+
|text |date |multi_date |
+--------------------------------------------------------+------------+--------------------+
|See you on next monday. |[2023/01/09]|[01/09/23] |
|She was born on 02/03/1966. |[1966/02/03]|[02/03/66] |
|The project started yesterday and will finish next year.|[2024/01/05]|[01/05/24, 01/04/23]|
|She will graduate by July 2023. |[2023/07/01]|[07/01/23] |
|She will visit doctor tomorrow and next month again. |[2023/02/05]|[02/05/23, 01/06/23]|
+--------------------------------------------------------+------------+--------------------+
As seen above result, DateMatcher
provides only one date per input document and MultiDateMatcher
can return multiple dates.
And here we provided different output formats for date matchers in the pipeline. As a result, we get other output formatted dates.
result.select("date","multi_date").show(truncate=False)
+-------------------------------------------------+----------------------------------------------------------------------------------------------+
|date |multi_date |
+-------------------------------------------------+----------------------------------------------------------------------------------------------+
|[{date, 11, 18, 2023/01/09, {sentence -> 0}, []}]|[{date, 11, 18, 01/09/23, {sentence -> 0}, []}] |
|[{date, 16, 25, 1966/02/03, {sentence -> 0}, []}]|[{date, 16, 25, 02/03/66, {sentence -> 0}, []}] |
|[{date, 46, 54, 2024/01/05, {sentence -> 0}, []}]|[{date, 46, 54, 01/05/24, {sentence -> 0}, []}, {date, 20, 28, 01/04/23, {sentence -> 0}, []}]|
|[{date, 21, 29, 2023/07/01, {sentence -> 0}, []}]|[{date, 21, 29, 07/01/23, {sentence -> 0}, []}] |
|[{date, 35, 44, 2023/02/05, {sentence -> 0}, []}]|[{date, 35, 44, 02/05/23, {sentence -> 0}, []}, {date, 22, 29, 01/06/23, {sentence -> 0}, []}]|
+-------------------------------------------------+----------------------------------------------------------------------------------------------+
Handling relative dates
DateMatcher
and MultiDateMatcher
annotators return relative dates as actual dates. But in this situation, we need to provide a reference point for the date. To accomplish this, an anchor date should be set so that the actual date can be calculated. These reference date parameters can be set by setAnchorDateDay(), setAnchorDateMonth(), setAnchorDateYear()
.
If an anchor date parameter is not set, the current day or current month, or current year will be set as the default value.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
multiDate = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date") \
.setOutputFormat("MM/dd/yyyy")\
.setAnchorDateYear(2001)\
.setAnchorDateMonth(1)\
.setAnchorDateDay(17)\
multiDate_no_day = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date_no_day") \
.setOutputFormat("MM/dd/yyyy")\
.setAnchorDateYear(2001)\
.setAnchorDateMonth(1)\
pipeline = Pipeline().setStages([
documentAssembler,
date,
multiDate,
multiDate_no_day
])
result = pipeline.fit(spark_df).transform(spark_df)
text_list = [ "See you on next monday.",
"She was born on 02/03/1966.",
"The project started on yesterday and will finish next year.",
"She will graduate by July 2023.",
"She will visit doctor tomorrow and next month again."]
spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")
result.selectExpr("text", "multi_date.result as date", "multi_date_no_day.result as date_no_day_anchor").show(truncate=False)
+--------------------------------------------------------+------------------------+------------------------+
|text |date |date_no_day_anchor |
+--------------------------------------------------------+------------------------+------------------------+
|See you on next monday. |[01/22/2001] |[01/08/2001] |
|She was born on 02/03/1966. |[02/03/1966] |[02/03/1966] |
|The project started yesterday and will finish next year.|[01/17/2002, 01/16/2001]|[01/05/2002, 01/04/2001]|
|She will graduate by July 2023. |[07/01/2023] |[07/01/2023] |
|She will visit doctor tomorrow and next month again. |[02/17/2001, 01/18/2001]|[02/05/2001, 01/06/2001]|
+--------------------------------------------------------+------------------------+------------------------+
In the date
column, relative dates are referenced from the date 01/17/2001
, and in date_no_day_anchor
column, anchor day is not set. The relative dates are calculated and printed in the column according to this reference date. When the anchorDateDay
parameter is not set as in date_no_day_anchor
column, by default it is set to the current day of the month.
Setting date formats
Input and output date formats can be set by the setInputFormats, setOutputFormat, setReadMonthFirst
. You can use get more information on how to use date formatting strings here.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
multiDate_1 = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date_1") \
.setOutputFormat("MM/dd/yy")
multiDate_2 = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date_2") \
.setOutputFormat("MMMM dd, yyyy")
multiDate_3 = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date_3") \
.setInputFormats(["dd/MM/yyyy"])\
.setOutputFormat("EEEE, MM/dd/yyyy")
pipeline = Pipeline().setStages([
documentAssembler,
multiDate_1,multiDate_2,multiDate_3,
multiDate
])
text_list = ["See you on 1st December 2004.",
"She was born on 02/03/1966.",
"The project started on yesterday and will finish next year.",
"She will graduate by July 2023.",
"She will visit doctor tomorrow and next month again."]
spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text", "multi_date_1.result as date_1", "multi_date_2.result as date_2", "multi_date_3.result as date_3").show(truncate=False)
+-----------------------------------------------------------+--------------------+-------------------------------------+-----------------------+
|text |date_1 |date_2 |date_3 |
+-----------------------------------------------------------+--------------------+-------------------------------------+-----------------------+
|See you on 1st December 2004. |[12/01/04] |[December 01, 2004] |[] |
|She was born on 02/03/1966. |[02/03/66] |[February 03, 1966] |[Wednesday, 03/02/1966]|
|The project started on yesterday and will finish next year.|[01/05/24, 01/04/23]|[January 05, 2024, January 04, 2023] |[] |
|She will graduate by July 2023. |[07/01/23] |[July 01, 2023] |[] |
|She will visit doctor tomorrow and next month again. |[02/05/23, 01/06/23]|[February 05, 2023, January 06, 2023]|[] |
+-----------------------------------------------------------+--------------------+-------------------------------------+-----------------------+
If the day information is not in the text
Sometimes in a date expression, days are not specified. For example “She will graduate by July 2023”. In this situation, one can set a default day value for missing days using setDefaultDayWhenMissing
. If it is not set, the default value is 1
.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
multiDate = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("date") \
multiDate_missing_day_set = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("date_missing_day_set") \
.setDefaultDayWhenMissing(15)
pipeline = Pipeline().setStages([
documentAssembler,
multiDate,
multiDate_missing_day_set
])
text_list = ["See you on December 2004.",
"She was born on 02/03/1966.",
"The project started on yesterday and will finish next year.",
"She will graduate by July 2023.",
"She will visit doctor tomorrow and next month again."]
spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text", "date.result as date", "date_missing_day_set.result as date_missing_day_set").show(truncate=False)
+-----------------------------------------------------------+------------------------+------------------------+
|text |date |date_missing_day_set |
+-----------------------------------------------------------+------------------------+------------------------+
|See you on December 2004. |[2004/12/01] |[2004/12/15] |
|She was born on 02/03/1966. |[1966/02/03] |[1966/02/03] |
|The project started on yesterday and will finish next year.|[2024/01/05, 2023/01/04]|[2024/01/05, 2023/01/04]|
|She will graduate by July 2023. |[2023/07/01] |[2023/07/15] |
|She will visit doctor tomorrow and next month again. |[2023/02/05, 2023/01/06]|[2023/02/05, 2023/01/06]|
+-----------------------------------------------------------+------------------------+------------------------+
As seen from the above results, missing days at rows 1 and 4 are 15
at the column date_missing_day_set
, but 1
at date
column.
Using other languages
Date matchers can be used with other languages. Its default value is "en"
-English.
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
multiDate = MultiDateMatcher() \
.setInputCols("document") \
.setOutputCol("multi_date") \
.setOutputFormat("yyyy/MM/dd")\
.setSourceLanguage("de")
pipeline = Pipeline().setStages([
documentAssembler,
multiDate
])
spark_df = spark.createDataFrame([["Das letzte zahlungsdatum dieser rechnung ist der 4. mai 1998."],
["Wir haben morgen eine prüfung."]]).toDF("text")
result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text", "multi_date.result as date").show(truncate=False)
+-------------------------------------------------------------+------------+
|text |date |
+-------------------------------------------------------------+------------+
|Das letzte zahlungsdatum dieser rechnung ist der 4. mai 1998.|[1998/05/04]|
|Wir haben morgen eine prüfung. |[2023/01/06]|
+-------------------------------------------------------------+------------+
Date matchers can extract dates from other languages. In the above German example, the first row contains an actual date while the second one has a relative date (morgen means tomorrow in English). They are formatted in the desired output format.
You can find supported languages here
Conclusion
The Spark NLP DateMatcher and MultiDateMatcher annotators are powerful tools for extracting dates from the text. These annotators make it easy to extract dates in multiple languages, deal with relative dates, change input/output date formats, and even set missing day in date without day.
Throughout this post, we have described all of their parameters in detail, complete with examples to help you get started using them.
We hope that this post has provided you with a solid understanding of how to use these powerful Spark NLP annotators to extract dates from the text.
🔗 Call to action:
- Documentation: DateMatcher, MultiDateMatcher
- Python Doc: DateMatcher, MultiDateMatcher
- Scala Doc: DateMatcher, MultiDateMatcher
- For extended examples of usage, see the Spark NLP Workshop.