Convert CSV / JSON files to Apache Parquet using AWS Glue

Bhuvanesh
Bhuvanesh
Jul 23, 2018 · 4 min read
Source: aws.amazon.com

Why Parquet?

https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-12.csv

Create the crawlers:

AWSGlueServiceRole
S3 Read/Write access for your bucket.

Run the Crawler

Tabled added to the data cataloge
csv_files table created in the database(CSV files and table schema is same)

Create Parquet conversion Job:

DataStore: S3
Format: Parquet
TargetPath: s3://searce-bigdata/etl-job/parquet_files

Lets check the files in S3.

Searce Engineering

We identify better ways of doing things!

Bhuvanesh

Written by

Bhuvanesh

BigData | Database & Cloud Architect | blogger thedataguy.in

Searce Engineering

We identify better ways of doing things!