Dealing with Dates in Pyspark
This is one of my stories in spark deep dive
https://medium.com/@somanathsankaran
As a Data Engineer on of the common use-case is to adjust date across various sources.
In this blog we are going to see the below
- Converting from one date form to another
- Converting from one timestamp format to another
- Aggregate data at a monthly ,weekly level
Spark by default assumes date in “YYYY-MM-dd”(2020–08–22)
Converting from one date form to another.
Step1:Converting date into standard format
It is common that the data we get from source system may not be in required format as show below and it is wise to load the data as string as loading it as date results in loss of data .
So in order to convert it to standard date format we have to use to_date function which accepts the string column and the format in which our date is there
which is “dd-MM-YYYY” in our case
df_date=df.select(f.to_date(“dt”,”dd-MM-YYYY “).alias(“date_formatted”))
Step2:Converted to desired format:
We can convert to desired format using date_format function which accepts 2 arguments date field and the format it needs to be displayed which is “YYYY/MM/dd HH:mm:ss
Note:Since it is not in spark date format it will have the datatype of string
Converting from one timestamp format to another
Similar to to_date ,spark has to_timestamp to convert it into timestamp and from there we can leverage date format to convert any timestamp format we need
Aggregate data at a monthly ,weekly level
It is common usecase that we might need to do aggregations at a monthly ,weekly level etc for that we can truncate function in spark
For example we will consider 3 dates Aug 23,24,25 out of which 23,24 are in the week of aug 24 (Assuming week starting from monday) and and we will do a sum on a weekly level
Now we will aggregate the data and create a report on monthly and weekly level
So in this blog we have learned how to convert data from one format to another and how to do aggregation at a weekly level
Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv
Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)