Create Time Dimension or Calendar DataFrame in Apache PySpark and Save to Delta Lake Parquet File

Syarif Z
Nov 8, 2023

--

Before read this story please read my previous story
Schedule and Automate Batch Ingestion Apache PySpark Script using Apache Spark Submit and Cronjob

$ cd ~/ingestions
$ nano calendar.py
from pyspark.sql import functions as F

del calendarDf
calendarDf = spark.createDataFrame([(1,)], [“id”])

calendarDf = calendarDf \
.withColumn(“date_timestamp”,F.explode(F.expr(“sequence(to_timestamp(‘2009–01–01 00:00:00’), to_timestamp(‘2025–12–31 00:00:00’), interval 1 day)”)))

calendarDf = calendarDf
.withColumn(“date”,F.to_date(“date_timestamp”)) \
.withColumn(“day”,F.date_format(‘date_timestamp’, ‘d’)) \
.withColumn(“day_2”,F.date_format(‘date_timestamp’, ‘dd’)) \
.withColumn(“day_name”,F.date_format(‘date_timestamp’, ‘EEEE’)) \
.withColumn(“day_name_abbr”,F.date_format(‘date_timestamp’, ‘EE’)) \
.withColumn(“month”,F.month(“date_timestamp”)) \
.withColumn(“month_2”,F.date_format(‘date_timestamp’, ‘MM’)) \
.withColumn(“month_name”,F.date_format(‘date_timestamp’, ‘MMMM’)) \
.withColumn(“month_name_abbr”,F.date_format(‘date_timestamp’, ‘MMM’)) \
.withColumn(“year”,F.year(“date_timestamp”)) \
.withColumn(“week_of_year”,F.weekofyear(“date_timestamp”)) \
.withColumn(“date_description”,F.date_format(‘date_timestamp’, ‘MMMM d, yyyy’)) \
.withColumn(“quarter”,F.quarter(“date_timestamp”))

sql(“DROP TABLE IF EXISTS calendar”

calendarDf.write.mode(“overwrite”).format(“delta”).saveAsTable(“calendar”)

Ctrl+O to save and Ctrl-X to exit

$ /home/username/ingestions/spark-submit-individual.sh /home/username/ingestions/calendar.py

--

--