Convert CSV / JSON files to Apache Parquet using AWS Glue

Source: aws.amazon.com

AWS Glue is fully managed and serverless ETL service from AWS. From our recent projects we were working with Parquet file format to reduce the file size and the amount of data to be scanned. Of course Im a CSV lover, I can play with it using Athena, Bigquery and etc. But the customers not like me, they want to reduce the cost at the end of the day. Parquet is the perfect solution for this. For converting these files, we used AWS EMR cluster and GCP DataProc cluster. But these clusters are chargeable till the conversion done. We wanted to use a solution with Zero Administrative skills. And now we are using Glue for this. Yes, we can convert the CSV/JSON files to Parquet using AWS Glue. But this is not only the use case. You can convert to the below formats.

  • CSV
  • JSON
  • Parquet
  • Avro
  • ORC

Why Parquet?

Parquet is a columnar file format and provides efficient storage. Better compression for columnar and encoding algorithms are in place. Mostly we are using the large files in Athena. BigQuery is also supported the Parquet file format. So we can have a better control in Performance and the Cost.

Lets start to convert the files to Parquet.

  • CSV file location: s3://searce-bigdata/etl-job/csv_files
  • Parquet file location: s3://searce-bigdata/etl-job/parquet_files
  • Sample CSV files (from Nyc taxi data):
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-06.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-07.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-08.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-09.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-10.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-11.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2015-12.csv

Create the crawlers:

We need to create and run the Crawlers to identify the schema of the CSV files.

  • Go to AWS Glue home page.
  • From the Crawlers → add crawler.
  • Give a name for you crawler.
  • Data source S3 and the Include path should be you CSV files folder.
  • The next step will ask to add more data source, Just click NO.
  • Next one for selecting the IAM role. The crawlers needs read access of the S3, but save the Parquet files, it needs the Write access too. So create a role along with the following policies.
AWSGlueServiceRole
S3 Read/Write access for your bucket.
  • In the next step just let the crawler as Run as On Demand.
  • Then it’ll ask a database name to create a table schema for the CSV file.
  • Give the path name as: s3://searce-bigdata/etl-job/csv_files

Run the Crawler

Once its created, it’ll ask to run. Click run and wait for few mins, then you can see that it’s created a new table with the same schema of your CSV files in the Data catalogue.

Tabled added to the data cataloge

Then it’ll create the table name as the CSV file location. [csv_file]

csv_files table created in the database(CSV files and table schema is same)

Create Parquet conversion Job:

  • In the ETL Section, go to Jobs → add Job.
  • Give a name for your job and select the IAM role(select the one which we have created in the previous step).
  • Choose Data Source: Select the datasource which is created by the crawler.
  • In Choose your data target,
DataStore: S3
Format: Parquet
TargetPath: s3://searce-bigdata/etl-job/parquet_files
  • The next windows is for column mapping. If you need to remap any column or remove any columns from CSV, you can achieve it from here.
  • Now the next one will show you the Diagram and source code for the job. Just click the run job button.
  • Wait for few mins(its based on your total amount of data) to complete the job. You can see the logs from the bottom.

Lets check the files in S3.

Go to s3://searce-bigdata/etl-job/parquet_files and see the converted files and their size.

Lets kick start your ETL skills with Glue by now. Try it and use Athena then see the amount of data that it scanned from CSV and compare with Parquet.

Happy ETL :)