EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Deep Dive into Apache Spark DateTime Functions

Catalog of DateTime functions in Apache Spark

Neeraj Bhadani
Expedia Group Technology

--

Photo by petradr on Unsplash

DateTime functions will always be tricky but very important irrespective of language or framework. In this blog post, we review the DateTime functions available in Apache Spark™️. We show examples in an easy to comprehend standardized format to augment the brief examples in the Apache Spark docs.

You may also be interested in my other posts on Apache Spark.

Let’s begin the fun with DateTime function in Apache Spark!

https://spark.apache.org/

Setup

Here are a few key setup steps before trying the examples further down.

Import Functions in pySpark

from pyspark.sql.functions import *

Create Sample DataFrame

Let’s try to create a sample DataFrame so that we can use it for the rest of this blog to understand the various DateTime functions.

emp = [(1, "AAA", "dept1", 1000, "2019-02-01 15:12:13"),
(2, "BBB", "dept1", 1100, "2018-04-01 5:12:3"),
(3, "CCC", "dept1", 3000, "2017-06-05 1:2:13"),
(4, "DDD", "dept1", 1500, "2019-08-10 10:52:53"),
(5, "EEE", "dept2", 8000, "2016-01-11 5:52:43"),
(6, "FFF", "dept2", 7200, "2015-04-14 19:32:33"),
(7, "GGG", "dept3", 7100, "2019-02-21 15:42:43"),
(8, "HHH", "dept3", 3700, "2016-09-25 15:32:33"),
(9, "III", "dept3", 4500, "2017-10-15 15:22:23"),
(10, "JJJ", "dept5", 3400, "2018-12-17 15:14:17")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary", "date"])

add_months

This function adds months to a date. It will return a new date, however many months from the start date. In the below statement we add 1 month to the column “date” and generated a new column as “next_month”.

df = (empdf
.select("date")
.withColumn("next_month", add_months("date", 1)))
df.show(2)

Output

+-------------------+----------+
| date|next_month|
+-------------------+----------+
|2019-08-10 10:52:53|2019-09-10|
| 2018-04-01 5:12:3|2018-05-01|
+-------------------+----------+

current_date

This function returns the current date.

df = (empdf
.withColumn("current_date", current_date())
.select("id", "current_date"))
df.show(2)

Output

+---+------------+
| id|current_date|
+---+------------+
| 5| 2019-10-07|
| 10| 2019-10-07|
+---+------------+

current_timestamp

This function returns the current timestamp.

df = (empdf
.withColumn("current_timestamp", current_timestamp())
.select("id", "current_timestamp"))
df.show(2,False)

Output

+---+-----------------------+
|id |current_timestamp |
+---+-----------------------+
|1 |2019-10-07 14:40:42.471|
|2 |2019-10-07 14:40:42.471|
+---+-----------------------+

date_add

This function returns a date x days after the start date passed to the function. In the example below, it returns a date 5 days after “date” in a new column as “next_date”. E.g. for date: 1st Feb 2019 it returns 6th Feb 2019.

df = (empdf
.select("date")
.withColumn("next_date", date_add("date", 5)))
df.show(2)

Output

+-------------------+----------+
| date| next_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-06|
| 2018-04-01 5:12:3|2018-04-06|
+-------------------+----------+

date_format

This function will convert the date to the specified format. For example, we can convert the date from “yyyy-MM-dd” to “dd/MM/yyyy” format.

df = (empdf
.select("date")
.withColumn("new_date", date_format("date", "dd/MM/yyyy")))
df.show(2)

Output

+-------------------+----------+
| date| new_date|
+-------------------+----------+
|2019-02-01 15:12:13|01/02/2019|
| 2018-04-01 5:12:3|01/04/2018|
+-------------------+----------+

date_sub

This function returns a date some number of the days before the date passed to it. It is the opposite of date_add. In the example below, it returns a date that is 5 days earlier in a column as “new_date”. For example, date 1st Feb 2019 returns 27th Jan 2019.

df = (empdf
.select("date")
.withColumn("new_date", date_sub("date", 5)))
df.show(2)

Output

+-------------------+----------+
| date| new_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-01-27|
| 2018-04-01 5:12:3|2018-03-27|
+-------------------+----------+

date_trunc

This function returns a timestamp truncated to the specified unit. It could be a year, month, day, hour, minute, second, week or quarter.

  • Let’s truncate the date by a year. we can use “yyyy” or “yy” or” “year” to specify year. For timestamp “2019–02–01 15:12:13”, if we truncate based on the year it will return “2019–01–01 00:00:00”
df = (empdf
.select("date")
.withColumn("new_date", date_trunc("year", "date")))
df.show(2)

Output

+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-01-01 00:00:00|
| 2018-04-01 5:12:3|2018-01-01 00:00:00|
+-------------------+-------------------+

Let’s truncate date by month. We use “mm” or “month” or” “mon” to specify month. For timestamp “2019–02–01 15:12:13”, if we truncate based on month it returns “2019–02–01 00:00:00”

df = (empdf
.select("date")
.withColumn("new_date", date_trunc("month", "date")))
df.show(2)

Output

+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 00:00:00|
| 2018-04-01 5:12:3|2018-04-01 00:00:00|
+-------------------+-------------------+

Let’s truncate date by day. We can use “day” or “dd” to specify day. For timestamp “2019–02–10 15:12:13”, if we truncate based on day it will return “2019–02–10 00:00:00”

df = (empdf
.select("date")
.withColumn("new_date", date_trunc("day", "date")))
df.show(2)

Output

+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 00:00:00|
| 2018-04-01 5:12:3|2018-04-01 00:00:00|
+-------------------+-------------------+

Note: we can use these many formats to truncate the date based on different levels. we can use these formats: ‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’, ‘week’, ‘quarter’.

datediff

This function returns the difference between dates in terms of days. Let’s add another column as the current date and then take the difference between “current_date” and “date”.

df = (empdf.select("date")
# Add another date column as current date.
.withColumn("current_date", current_date())
# Take the difference between current_date and date column.
.withColumn("date_diff", datediff("current_date", "date")))
df.show(2)

Output

+-------------------+------------+---------+
| date|current_date|date_diff|
+-------------------+------------+---------+
|2019-02-01 15:12:13| 2019-10-07| 248|
| 2018-04-01 5:12:3| 2019-10-07| 554|
+-------------------+------------+---------+

dayofmonth

This function returns the day of the month. For 5th Jan 2019 (2019–01–05) it will return 5.

df = (empdf
.select("date")
.withColumn("dayofmonth", dayofmonth("date")))
df.show(2)

Output

+-------------------+----------+
| date|dayofmonth|
+-------------------+----------+
|2019-02-01 15:12:13| 1|
| 2018-04-01 5:12:3| 1|
+-------------------+----------+

dayofweek

This function returns the day of the week as an integer. It will consider Sunday as 1st and Saturday as 7th. For 1st Feb 2019 (2019–02–01) which is Friday, it will return 6. Similarly, for 1st April 2018 (2018–04–01) which is Sunday, it will return 1.

df = (empdf
.select("date")
.withColumn("dayofweek", dayofweek("date")))
df.show(2)

Output

+-------------------+---------+
| date|dayofweek|
+-------------------+---------+
|2019-02-01 15:12:13| 6|
| 2018-04-01 5:12:3| 1|
+-------------------+---------+

dayofyear

This function returns the day of the year as an integer. For 1st Feb it will return 32 (31 days of Jan +1 day of Feb). For 1st April 2018, it will return 91 (31 days of Jan + 28 days of Feb (2018 is a non-leap year) + 31 days of Mar + 1 day of April).

df = (empdf
.select("date")
.withColumn("dayofyear", dayofyear("date")))
df.show(2)

Output

+-------------------+---------+
| date|dayofyear|
+-------------------+---------+
|2019-02-01 15:12:13| 32|
| 2018-04-01 5:12:3| 91|
+-------------------+---------+

from_utc_timestamp

This function converts UTC timestamps to timestamps of any specified timezone. By default, it assumes the date is a UTC timestamp.
Let's convert a UTC timestamp to “PST” time.

df = (empdf
.select("date")
.withColumn("pst_timestamp", from_utc_timestamp("date", "PST")))
df.show(2)

Output

+-------------------+-------------------+
| date| pst_timestamp|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 07:12:13|
| 2018-04-01 5:12:3|2018-03-31 22:12:03|
+-------------------+-------------------+

unix_timestamp

This function converts timestamp strings of the given format to Unix timestamps (in seconds). The default format is “yyyy-MM-dd HH:mm:ss”. (Note: You can use spark property: “spark.sql.session.timeZone” to set the timezone.)

df = (empdf
.select("date")
.withColumn("unix_timestamp", unix_timestamp("date", "yyyy-MM-dd HH:mm:ss")))
df.show(2)

Output

+-------------------+--------------+
| date|unix_timestamp|
+-------------------+--------------+
|2019-02-01 15:12:13| 1549033933|
| 2018-04-01 5:12:3| 1522559523|
+-------------------+--------------+

from_unixtime

This function converts the number of seconds from Unix epoch (1970–01–01 00:00:00 UTC) to a given string format. You can set the timezone and format as well. (Note: you can use spark property: “spark.sql.session.timeZone” to set the timezone). For demonstration purposes, we have converted the timestamp to Unix timestamp and converted it back to timestamp.

df = (empdf
.select("date")
# Convert timestamp to unix timestamp.
.withColumn("unix_timestamp", unix_timestamp("date", "yyyy-MM-dd HH:mm:ss"))
# Convert unix timestamp to timestamp.
.withColumn("date_from_unixtime", from_unixtime("unix_timestamp")))
df.show(2)

Output

+-------------------+--------------+-------------------+
| date|unix_timestamp| date_from_unixtime|
+-------------------+--------------+-------------------+
|2019-02-01 15:12:13| 1549033933|2019-02-01 15:12:13|
| 2018-04-01 5:12:3| 1522559523|2018-04-01 05:12:03|
+-------------------+--------------+-------------------+

hour

This function will return the hour part of the date.

df = (empdf
.select("date")
.withColumn("hour", hour("date")))
df.show(2)

Output

+-------------------+----+
| date|hour|
+-------------------+----+
|2019-02-01 15:12:13| 15|
| 2018-04-01 5:12:3| 5|
+-------------------+----+

last_day

This function will return the last date of the month for a given date. For 5th Jan 2019, it will return 31st Jan 2019, since this is the last date for the month.

df = empdf.select("date").withColumn("last_date", last_day("date"))

Output

+-------------------+----------+
| date| last_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-28|
| 2018-04-01 5:12:3|2018-04-30|
+-------------------+----------+

minute

This function will return minute part of the date.

df = (empdf
.select("date")
.withColumn("minute", minute("date")))
df.show(2)

Output

+-------------------+------+
| date|minute|
+-------------------+------+
|2019-02-01 15:12:13| 12|
| 2018-04-01 5:12:3| 12|
+-------------------+------+

month

This function will return the month part of the date.

df = (empdf
.select("date")
.withColumn("month", month("date")))
df.show(2)

Output

+-------------------+-----+
| date|month|
+-------------------+-----+
|2019-02-01 15:12:13| 2|
| 2018-04-01 5:12:3| 4|
+-------------------+-----+

months_between

This function returns the difference between dates in terms of months. If the first date is greater than the second one, the result will be positive else negative. For example, between 6th Feb 2019 and 5th Jan 2019, it will return 1.

df = (empdf
.select("date")
# Add another date column as current date.
.withColumn("current_date", current_date())
# Take the difference between current_date and date column in terms of months.
.withColumn("months_between", months_between("current_date", "date")))
df.show(2)

Note: from Spark 2.4.0 onwards you can specify the third argument “roundOff=True” to round-Off the value. The default value is True.

Output

+-------------------+------------+--------------+
| date|current_date|months_between|
+-------------------+------------+--------------+
|2019-02-01 15:12:13| 2019-10-09| 8.23762955|
| 2018-04-01 5:12:3| 2019-10-09| 18.25107415|
+-------------------+------------+--------------+

next_day

This function will return the next day based on the dayOfWeek specified in the next argument. For e.g. for 1st Feb 2019 (Friday) if we ask for next_day as Sunday, it will return 3rd Feb 2019.

df = (empdf
.select("date")
.withColumn("next_day", next_day("date", "sun")))
df.show(2)

Output

+-------------------+----------+
| date| next_day|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-03|
| 2018-04-01 5:12:3|2018-04-08|
+-------------------+----------+

quarter

This function will return a quarter of the given date as an integer.

df = (empdf
.select("date")
.withColumn("quarter", quarter("date")))
df.show(2)

Output

+-------------------+-------+
| date|quarter|
+-------------------+-------+
|2019-02-01 15:12:13| 1|
| 2018-04-01 5:12:3| 2|
+-------------------+-------+

second

This function will return the second part of the date.

df = (empdf
.select("date")
.withColumn("second", second("date")))
df.show(2)

Output

+-------------------+------+
| date|second|
+-------------------+------+
|2019-02-01 15:12:13| 13|
| 2018-04-01 5:12:3| 3|
+-------------------+------+

to_date

This function will convert the String or TimeStamp to Date.

df = (empdf
.select("date")
.withColumn("to_date", to_date("date")))
df.show(2)

Note: Check the data type of column “date” and “to-date”.

If the string format is ‘yyyy-MM-dd HH:mm:ss’ then we need not specify the format. Otherwise, specify the format as the second arg in to_date function.

Output

+-------------------+----------+
| date| to_date|
+-------------------+----------+
|2019-02-01 15:12:13|2019-02-01|
| 2018-04-01 5:12:3|2018-04-01|
+-------------------+----------+

Here we convert a string of format ‘dd/MM/yyyy HH:mm:ss’ to “date” data type. Note the default format is ‘yyyy-MM-dd`.

df1 = spark.createDataFrame([('15/02/2019 10:30:00',)], ['date'])
df2 = (df1
.withColumn("new_date", to_date("date", 'dd/MM/yyyy HH:mm:ss')))
df2.show(2)

Output

+-------------------+----------+
| date| new_date|
+-------------------+----------+
|15/02/2019 10:30:00|2019-02-15|
+-------------------+----------+

to_timestamp

This function converts String to TimeStamp. Here we convert a string of format ‘dd/MM/yyyy HH:mm:ss’ to the “timestamp” data type. The default format is ‘yyyy-MM-dd HH:mm:ss’

df1 = spark.createDataFrame([('15/02/2019 10:30:00',)], ['date'])
df2 = (df1
.withColumn("new_date", to_timestamp("date", 'dd/MM/yyyy HH:mm:ss')))
df2.show(2)

Output

+-------------------+-------------------+
| date| new_date|
+-------------------+-------------------+
|15/02/2019 10:30:00|2019-02-15 10:30:00|
+-------------------+-------------------+

to_utc_timestamp

This function converts given timestamp to UTC timestamp. Let's convert a “PST” timestamp to “UTC” timestamp.

df = (empdf
.select("date")
.withColumn("utc_timestamp", to_utc_timestamp("date", "PST")))
df.show(2)

Output

+-------------------+-------------------+
| date| utc_timestamp|
+-------------------+-------------------+
|2019-02-01 15:12:13|2019-02-01 23:12:13|
| 2018-04-01 5:12:3|2018-04-01 12:12:03|
+-------------------+-------------------+

weekofyear

This function will return the weekofyear for the given date.

df = (empdf
.select("date")
.withColumn("weekofyear", weekofyear("date")))
df.show(2)

Output

+-------------------+----------+
| date|weekofyear|
+-------------------+----------+
|2019-02-01 15:12:13| 5|
| 2018-04-01 5:12:3| 13|
+-------------------+----------+

year

This function will return the year part of the date.

df = (empdf
.select("date")
.withColumn("year", year("date")))
df.show(2)

Output

+-------------------+----+
| date|year|
+-------------------+----+
|2019-02-01 15:12:13|2019|
| 2018-04-01 5:12:3|2018|
+-------------------+----+

I hope you have enjoyed learning about the DateTime functions in Apache Spark. For your easy reference, a notebook containing the examples above is available on GitHub.

Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

--

--