Spark CSV Deep Dive-part II
This is continuation to my previous post
This post consists of dealing with
1.Cleaning Date and timestamp
2.Compression-Option
In this we are going to deal with date format and time format
I have prepared a bad data set which will have fit all our use-case
Now we will try to import schema in database style which is very much easier and verify the schema using Dtypes
Viewing Schema Data
Step1:.Cleaning Date and timestamp
As shown above we have everything as null due to issue to date and timestamp fomat .By Default spark excepts timestamp or date in ``yyyy-MM-dd’T’HH:mm:ss.SSSXXX``
In spark we have a option to specify the date format and timestamp format while loading the data using dateFormat=”dd-MM-YYYY”,timestampFormat=”dd/MM/YY HH:mm”
Now we can see that date and timestamp column have been fixed
Step3:Compression Option
Spark csv reader has the ability to read the below compressed files
bzip2, deflate, uncompressed, lz4, gzip, snappy
But the problem is all data will loaded as single task so please finetune executor memory to hold all the data
Step4:Exploring Other option
We can all explore the other options provided we have some basic understanding scala syntax
CSVOptions.scala
Next Tutorial : https://medium.com/@somanathsankaran/spark-select-and-select-expr-deep-dive-d63ef5e04c87
Github Link:Github link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv
Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)
Learn and let others Learn!!