Effortless Efficiency: Converting CSV to ORC with Athena CTAS

5 min readAug 9, 2024

Introduction

Athena CTAS (Create Table As) is a powerful feature of Amazon Athena, a serverless query service for analyzing data in Amazon S3 using standard SQL. CTAS allows you to create a new table by executing a SQL query and specifying the desired format, compression, and partitioning options. This feature is particularly useful for transforming, filtering, and optimizing data as it’s being queried.

Pre-requisites

An AWS Account you can use.
An S3 bucket holding CSV data. (You can use https://github.com/kalikirisrikarreddy/amazon-athena to generate and store data in S3 in the way you want)
An S3 bucket to hold Athena query results.
An S3 bucket to store ORC data.

Execution steps

Let’s assume you already have Gzipped CSV data sitting in one of your S3 buckets.

Lets assume you already have an Athena table created over this partitioned data.

CREATE EXTERNAL TABLE `employeestablebackedbycsvins3`(
  `id` string,
  `firstname` string,
  `lastname` string,
  `company` string,
  `paddress` string,
  `facttheylike` string
)
PARTITIONED BY ( `year` int, `month` int, `day` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar'=',',
  'quoteChar'='"'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://athena-datasets-2/'
TBLPROPERTIES (
  'classification'='csv',
  'projection.day.range'='1,31',
  'projection.day.type'='integer',
  'projection.enabled'='true',
  'projection.month.range'='1,12',
  'projection.month.type'='integer',
  'projection.year.range'='2001,2020',
  'projection.year.type'='integer',
  'write.compression'='GZIP'
)

Now let’s use CTAS to convert the Gzipped CSV data to Snappy compressed ORC data.

An important limitation to using CTAS is that CTAS can’t handle more than 100 partitions per query.

We can get around it by using CTAS to first create Snappy compressed ORC data from a small subset of Gzipped CSV data which does not exceed 100 partition like below.

CREATE TABLE IF NOT EXISTS employeestablebackedbyorcins3 WITH (
  format = 'ORC',
  parquet_compression = 'SNAPPY',
  partitioned_by = ARRAY [ 'year', 'month', 'day' ],
  external_location = 's3://athena-datasets-3/'
) AS
SELECT id,
  firstname,
  lastname,
  company,
  paddress,
  facttheylike,
  year,
  month,
  day
FROM employeestablebackedbycsvins3
WHERE year = 2001 and month = 1;

And then load data into the Snappy compressed ORC data table incrementally without exceeding 100 partitions. for example,

INSERT INTO employeestablebackedbyorcins3
SELECT id,
  firstname,
  lastname,
  company,
  paddress,
  facttheylike,
  year,
  month,
  day
FROM employeestablebackedbycsvins3
WHERE year = 2002 and month >= 1 and month <= 3;

If you use the Athena query editor to run these queries one-by-one, it is going to eat up a lot of your time. Instead you can use step functions to load the data in parallel. It could look something like below.

Lets look at the results.

Conclusion

You might be thinking that you could’ve used hive to do the same. Well you could, but Amazon Athena CTAS is a serverless, simple, and cost-effective choice for converting CSV data to ORC format and loading it into an S3 bucket. It requires no cluster management and is pay-per-query. In contrast, Apache Hive on EMR involves setting up and managing clusters(optional if you use EMR serverless), requires HiveQL queries, offers more control but is more complex and potentially costly due to cluster provisioning and S3 storage costs. Choose Athena for simplicity and cost-efficiency, and Hive on EMR for more control and customization, considering your specific needs and expertise.

Disclaimer

Disclaimer: The information provided in this blog post is for general informational purposes only. It is not intended to provide professional advice or substitute for consultation with qualified experts. The author(Srikar Reddy Kalikiri) make no representations as to the accuracy, completeness, suitability, or validity of any information on this post. The use or reliance on any content is solely at your own risk.

References to Amazon Web Services (AWS) products, services, or features are made in the context of sharing knowledge and experiences. However, the author(Srikar Reddy Kalikiri) is not affiliated with AWS, and the opinions expressed in these blog posts are those of the author(Srikar Reddy Kalikiri) alone.

Readers are advised to independently verify the accuracy and applicability of any information presented here with official AWS documentation and resources. The author(Srikar Reddy Kalikiri) will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use.

When making decisions related to AWS services, it is recommended to consult official AWS documentation and seek advice from certified AWS professionals or legal experts as needed.

The use of any trademarks, names, or brands mentioned in this blog post does not imply endorsement or affiliation with the author(Srikar Reddy Kalikiri). All trademarks, product names, and company names or logos mentioned herein are the property of their respective owners.