Mastering Timestamp to Date Conversion in PySpark: Unlocking Time-Based Insights with Databricks

Naveen Sorout
Towards Data Engineering
3 min readMay 31, 2023

The to_date() function in Apache PySpark is popularly used to convert Timestamp to the date. This is mainly achieved by truncating the Timestamp column’s time part. The to_date() function takes TimeStamp as its input in the default format of “MM-dd-yyyy HH:mm:ss.SSS”. The Timestamp Type(timestamp) is also defined as input of the to_date() function in the format of “MM-dd-yyyy HH:mm:ss”. The “to_date(timestamping: Column, format: Timestamp)” is the syntax of the to_date() function where the first argument specifies the input of the timestamp string that is the column of the dataframe. The Second argument specifies an additional Timestamp argument that further specifies the input Timestamp format and helps cast the Timestamp from any format to the Default Timestamp type in the PySpark.

Implementing The to_date() Function In PySpark Using Databricks

# Importing package
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

The SparkSession and all packages are imported into the environment to convert Timestamp to Date in PySpark.

# Implementing the to_date() function in Databricks in PySpark

spark = SparkSession.builder \
.appName('PySpark to_date()') \
.getOrCreate()

dataframe = spark.createDataFrame(
data = [ ("1","2021-08-26 11:30:21.000")],
schema=["id","input_timestamp"])
dataframe.printSchema()

# Converting Timestamp String to DateType
dataframe.withColumn("date_type",to_date("input_timestamp")) \
.show(truncate=False)

# Coverting Timestamp Type to DateType
dataframe.withColumn("date_type",to_date(current_timestamp())) \
.show(truncate=False)

# Converting Timestamp Type(timestamp) to Datatype
dataframe.withColumn("ts",to_timestamp(col("input_timestamp"))) \
.withColumn("datetype",to_date(col("ts"))) \
.show(truncate=False)

# Using Cast to convert the Timestamp String to DateType
dataframe.withColumn('date_type', col('input_timestamp').cast('date')) \
.show(truncate=False)

# Using Cast to convert the TimestampType to DateType
dataframe.withColumn('date_type', to_timestamp('input_timestamp').cast('date')) \
.show(truncate=False)

The “dataframe” value is created in which the data is defined — using the to_date() function converting the Timestamp String to Datatype, TimeString to Timestamp (TimestampType) in the PySpark. Using the cast() function, the string conversion to timestamp occurs when the timestamp is not in the custom format and is first converted into the appropriate one.

How To Convert The Timestamp Datatype In PySpark?

In PySpark, the TimestampType is used to represent date and time values. To convert a timestamp from one format to another, you can use the to_timestamp function provided by PySpark. This function takes two arguments: the timestamp column you want to convert and the format to which you want to convert it.

For example, if you have a timestamp column called my_timestamp in the format ‘yyyy-MM-dd HH:mm:ss’ and you want to convert it to the format ‘yyyy-MM-dd’, you can use the following code:

from pyspark.sql.functions import to_timestamp

df = df.withColumn(“new_timestamp”, to_timestamp(“my_timestamp”, “yyyy-MM-dd”).cast(“timestamp”))

Here, we first import the to_timestamp function and then apply it to the my_timestamp column. We specify the format we want to convert to as “yyyy-MM-dd”. Finally, we cast the result to the TimestampType. The resulting column is added to the dataframe as new_timestamp.

--

--