Spark CSV Deep Dive-part II

somanath sankaran
Analytics Vidhya
Published in
2 min readNov 30, 2019

This is continuation to my previous post

This post consists of dealing with

1.Cleaning Date and timestamp

2.Compression-Option

In this we are going to deal with date format and time format

I have prepared a bad data set which will have fit all our use-case

Now we will try to import schema in database style which is very much easier and verify the schema using Dtypes

Viewing Schema Data

Step1:.Cleaning Date and timestamp

As shown above we have everything as null due to issue to date and timestamp fomat .By Default spark excepts timestamp or date in ``yyyy-MM-dd’T’HH:mm:ss.SSSXXX``

In spark we have a option to specify the date format and timestamp format while loading the data using dateFormat=”dd-MM-YYYY”,timestampFormat=”dd/MM/YY HH:mm”

Now we can see that date and timestamp column have been fixed

Step3:Compression Option

Spark csv reader has the ability to read the below compressed files

bzip2, deflate, uncompressed, lz4, gzip, snappy

But the problem is all data will loaded as single task so please finetune executor memory to hold all the data

Step4:Exploring Other option

We can all explore the other options provided we have some basic understanding scala syntax

CSVOptions.scala

Next Tutorial : https://medium.com/@somanathsankaran/spark-select-and-select-expr-deep-dive-d63ef5e04c87

Github Link:Github link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

--

--