Connect to AWS Athena using Datagrip

Kris Peeters
datamindedbe
Published in
4 min readMay 1, 2018

Datagrip is a great database IDE, just like the other IDEs from Jetbrains: Pycharm, IntelliJ, … In this blog, I describe how to connect an AWS Athena database to Datagrip, for my own reference and hopefully helpful for someone else as well. This assumes that you already have data on S3, which is connected to Athena and you can already query it in the AWS console.

Step 1: download and install the JDBC driver

You can find the JDBC driver here. Store it somewhere on your hard drive. Then, open Datagrip and go to File -> Data Sources. Click the + sign in the top left, and select Driver . Complete as shown below:

Step 2: Configure an IAM user with the right permissions

Go to the AWS IAM console and create a user with the following inline policy (or attach the policy to an existing user):

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:*"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::THE-S3-BUCKET-TO-STORE-QUERY-RESULTS"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::THE-S3-BUCKET-TO-QUERY"
]
}
]
}

Don’t forget to replace THE-S3-BUCKET-TO-QUERY with the actual S3 bucket that Athena uses to query and THE-S3-BUCKET-TO-STORE-QUERY-RESULTS with an S3 bucket that Athena can use to store its query results. When you query Athena using the AWS console, it’s something like s3://aws-athena-query-results-1234567890-eu-west-1. Make sure to create and download an Access Key for that user. You can do that by going to that user, and then to the Security Credentials tab.

Once there, click the Access Key button.

Copy the Access key ID and the Secret access key somewhere. Those will be the username and password you need to connect to Athena in Step 3.

Step 3: Add an Athena data source in Datagrip

Open Datagrip again and go to Data sources again (File -> Data Sources). Click the + symbol again and now select Athena Driver. In the General tab, add the following url: jdbc:awsathena://athena.eu-west-1.amazonaws.com:443/. Change the region if needed. Also set the user and password. As you could’ve guessed, those are the Access key ID and the Secret access key which you created in step 2. Next, go to Advanced and set the s3_staging_dir to the S3 query results bucket you configured above. Now go back to the General tab to test your connection. It should work.

If you don’t see all the tables you expected, you can go to the Schemas tab and select the schemas you need.

That’s it. Happy querying massive amounts of data on S3 from the convenience of Datagrip. Do be wary of the costs. In Athena, you pay $5 per TB scanned. And in Datagrip, you don’t see how much data each query scanned. As a general tip:

  • Use Partitioning as much as possible
  • Use columnar formats such as Parquet or ORC
  • Keep track of your spending

In reality, we’ve observed Athena costs only very little for what it delivers. Remember a small RDS database quickly costs about $35 per month. An example aggregation query on a table of 1 billion rows, stored in Parquet format, scanned about 1GB. Remember it only scans the columns it needs in the aggregation and Parquet compresses really well. It took 6 seconds to execute that query. To compare that to a small RDS database, with Athena, we can run 7000 of those queries for the same $35. Asking a small RDS database to run 7000 aggregation queries on 1 billion rows will probably take several months to execute. And would need a lot more storage than 1GB. :-)

--

--

Kris Peeters
datamindedbe

Data geek at heart. Founder and CEO of Data Minded.