Calculating Date Differences and Months Between with PySpark in Databricks: A Comprehensive Guide

Naveen Sorout
Towards Data Engineering
2 min readJun 2, 2023

The date diff() function in Pyspark is popularly used to get the difference of dates and the number of days between the dates specified. The “datediff(date, date)” is the syntax of the datediff() function where the first argument specifies the input of the Date and the Second argument specifies an additional Date argument from which the first argument is differentiated. The months_between() function in Apache PySpark is popularly used to get the difference of dates and the number of months between the dates specified. The “months_between(date, date)” is the syntax of the months_between() function where the first argument specifies the input of the Date and the Second argument specifies an additional Date argument from which the first argument is differentiated.

Implementing the datediff() and months_between() function in Databricks in PySpark

# Importing package
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

The SparkSession and all packages are imported in the environment to perform usage of datediff() and months_between() functions in PySpark

# Implementing CSV file in PySpark 

spark = SparkSession.builder.appName('PySpark Read CSV').getOrCreate()

# Reading csv file
dataframe = spark.read.csv("/FileStore/tables/zipcodes-2.csv")
dataframe.printSchema()

# Using header record for the column names
dataframe2 = spark.read.option("header",True) \
.csv("/FileStore/tables/zipcodes-2.csv")
dataframe2.printSchema()

# Using delimiter
dataframe3 = spark.read.options(header='True', delimiter=',') \
.csv("/FileStore/tables/zipcodes-2.csv")
dataframe3.printSchema()

# Writing pyspark dataframe to csv file
dataframe3.write.option("header",True) \
.csv("/FileStore/tables/zipcodes-345.csv")

The “dataframe” value is created in which the data is defined. Using the date diff() function to get the difference of dates is getting the number of days between the dates specified. Further, alias like “datediff” is defined to identify the datediff() function output column easily. Using the months_between() function, getting the difference of dates is getting the number of months between the dates specified. The alias like “Dates_Diff”, “Months_Diff”, “Months_Diff_round”, “Years_Diff” and “Years_Diff_round” are defined to define the output of months_between() function. The “dataframe2” value is created in which the data is defined not in their default format of “YYYY-MM-DD.” The input dates are converted to their default format of “YYYY-MM-DD” using the to_date() function on the dataframe2 else; the output will be NULL.

--

--