Convert CSV / JSON files to Apache Parquet using AWS Glue

Jul 23, 2018 · 4 min read

AWS Glue is fully managed and serverless ETL service from AWS. From our recent projects we were working with Parquet file format to reduce the file size and the amount of data to be scanned. Of course Im a CSV lover, I can play with it using Athena, Bigquery and etc. But the customers not like me, they want to reduce the cost at the end of the day. Parquet is the perfect solution for this. For converting these files, we used AWS EMR cluster and GCP DataProc cluster. But these clusters are chargeable till the conversion done. We wanted to use a solution with Zero Administrative skills. And now we are using Glue for this. Yes, we can convert the CSV/JSON files to Parquet using AWS Glue. But this is not only the use case. You can convert to the below formats.

  • CSV
  • JSON
  • Parquet
  • Avro
  • ORC

Why Parquet?

Parquet is a columnar file format and provides efficient storage. Better compression for columnar and encoding algorithms are in place. Mostly we are using the large files in Athena. BigQuery is also supported the Parquet file format. So we can have a better control in Performance and the Cost.

Lets start to convert the files to Parquet.

  • CSV file location: s3://searce-bigdata/etl-job/csv_files
  • Parquet file location: s3://searce-bigdata/etl-job/parquet_files
  • Sample CSV files (from Nyc taxi data):

Create the crawlers:

We need to create and run the Crawlers to identify the schema of the CSV files.

  • Go to AWS Glue home page.
  • From the Crawlers → add crawler.
  • Give a name for you crawler.
  • Data source S3 and the Include path should be you CSV files folder.
  • The next step will ask to add more data source, Just click NO.
  • Next one for selecting the IAM role. The crawlers needs read access of the S3, but save the Parquet files, it needs the Write access too. So create a role along with the following policies.
S3 Read/Write access for your bucket.
  • In the next step just let the crawler as Run as On Demand.
  • Then it’ll ask a database name to create a table schema for the CSV file.
  • Give the path name as: s3://searce-bigdata/etl-job/csv_files

Run the Crawler

Once its created, it’ll ask to run. Click run and wait for few mins, then you can see that it’s created a new table with the same schema of your CSV files in the Data catalogue.

Tabled added to the data cataloge

Then it’ll create the table name as the CSV file location. [csv_file]

csv_files table created in the database(CSV files and table schema is same)

Create Parquet conversion Job:

  • In the ETL Section, go to Jobs → add Job.
  • Give a name for your job and select the IAM role(select the one which we have created in the previous step).
  • Choose Data Source: Select the datasource which is created by the crawler.
  • In Choose your data target,
DataStore: S3
Format: Parquet
TargetPath: s3://searce-bigdata/etl-job/parquet_files
  • The next windows is for column mapping. If you need to remap any column or remove any columns from CSV, you can achieve it from here.
  • Now the next one will show you the Diagram and source code for the job. Just click the run job button.
  • Wait for few mins(its based on your total amount of data) to complete the job. You can see the logs from the bottom.

Lets check the files in S3.

Go to s3://searce-bigdata/etl-job/parquet_files and see the converted files and their size.

Lets kick start your ETL skills with Glue by now. Try it and use Athena then see the amount of data that it scanned from CSV and compare with Parquet.

Update: 2019–10–08

By default, glue generates more number of output files. Sometimes 500+. If you want to control the files limit, you can do this in 2 ways. Read those steps in the below link.

Happy ETL :)

Searce Engineering

We identify better ways of doing things!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store