Connect to AWS Athena using Datagrip
Datagrip is a great database IDE, just like the other IDEs from Jetbrains: Pycharm, IntelliJ, … In this blog, I describe how to connect an AWS Athena database to Datagrip, for my own reference and hopefully helpful for someone else as well. This assumes that you already have data on S3, which is connected to Athena and you can already query it in the AWS console.
Step 1: download and install the JDBC driver
You can find the JDBC driver here. Store it somewhere on your hard drive. Then, open Datagrip and go to File -> Data Sources
. Click the + sign in the top left, and select Driver
. Complete as shown below:
Step 2: Configure an IAM user with the right permissions
Go to the AWS IAM console and create a user with the following inline policy (or attach the policy to an existing user):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:*"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::THE-S3-BUCKET-TO-STORE-QUERY-RESULTS"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::THE-S3-BUCKET-TO-QUERY"
]
}
]
}
Don’t forget to replace THE-S3-BUCKET-TO-QUERY
with the actual S3 bucket that Athena uses to query and THE-S3-BUCKET-TO-STORE-QUERY-RESULTS
with an S3 bucket that Athena can use to store its query results. When you query Athena using the AWS console, it’s something like s3://aws-athena-query-results-1234567890-eu-west-1
. Make sure to create and download an Access Key for that user. You can do that by going to that user, and then to the Security Credentials
tab.
Once there, click the Access Key
button.
Copy the Access key ID
and the Secret access key
somewhere. Those will be the username and password you need to connect to Athena in Step 3.
Step 3: Add an Athena data source in Datagrip
Open Datagrip again and go to Data sources again (File -> Data Sources
). Click the + symbol again and now select Athena Driver
. In the General
tab, add the following url: jdbc:awsathena://athena.eu-west-1.amazonaws.com:443/
. Change the region if needed. Also set the user
and password
. As you could’ve guessed, those are the Access key ID
and the Secret access key
which you created in step 2. Next, go to Advanced
and set the s3_staging_dir
to the S3 query results bucket you configured above. Now go back to the General tab to test your connection. It should work.
If you don’t see all the tables you expected, you can go to the Schemas
tab and select the schemas you need.
That’s it. Happy querying massive amounts of data on S3 from the convenience of Datagrip. Do be wary of the costs. In Athena, you pay $5 per TB scanned. And in Datagrip, you don’t see how much data each query scanned. As a general tip:
- Use Partitioning as much as possible
- Use columnar formats such as Parquet or ORC
- Keep track of your spending
In reality, we’ve observed Athena costs only very little for what it delivers. Remember a small RDS database quickly costs about $35 per month. An example aggregation query on a table of 1 billion rows, stored in Parquet format, scanned about 1GB. Remember it only scans the columns it needs in the aggregation and Parquet compresses really well. It took 6 seconds to execute that query. To compare that to a small RDS database, with Athena, we can run 7000 of those queries for the same $35. Asking a small RDS database to run 7000 aggregation queries on 1 billion rows will probably take several months to execute. And would need a lot more storage than 1GB. :-)