Dealing with Dates in Pyspark

somanath sankaran
Analytics Vidhya
Published in
3 min readAug 23, 2020

This is one of my stories in spark deep dive

https://medium.com/@somanathsankaran

As a Data Engineer on of the common use-case is to adjust date across various sources.

In this blog we are going to see the below

  1. Converting from one date form to another
  2. Converting from one timestamp format to another
  3. Aggregate data at a monthly ,weekly level

Spark by default assumes date in “YYYY-MM-dd”(2020–08–22)

Converting from one date form to another.

Step1:Converting date into standard format

It is common that the data we get from source system may not be in required format as show below and it is wise to load the data as string as loading it as date results in loss of data .

Non-formatted Date

So in order to convert it to standard date format we have to use to_date function which accepts the string column and the format in which our date is there

which is “dd-MM-YYYY” in our case

df_date=df.select(f.to_date(“dt”,”dd-MM-YYYY “).alias(“date_formatted”))

Converted to date using to_date

Step2:Converted to desired format:

We can convert to desired format using date_format function which accepts 2 arguments date field and the format it needs to be displayed which is “YYYY/MM/dd HH:mm:ss

Note:Since it is not in spark date format it will have the datatype of string

converted date as string

Converting from one timestamp format to another

Similar to to_date ,spark has to_timestamp to convert it into timestamp and from there we can leverage date format to convert any timestamp format we need

Converting timestamp

Aggregate data at a monthly ,weekly level

It is common usecase that we might need to do aggregations at a monthly ,weekly level etc for that we can truncate function in spark

For example we will consider 3 dates Aug 23,24,25 out of which 23,24 are in the week of aug 24 (Assuming week starting from monday) and and we will do a sum on a weekly level

Convert date to start of week using trunc fucntion

Now we will aggregate the data and create a report on monthly and weekly level

sold quantities at a week and monthly level

So in this blog we have learned how to convert data from one format to another and how to do aggregation at a weekly level

Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

--

--