AWS Athena: Into R, Python & Tableau (Part 2)

Anh Dang
6 min readJun 30, 2019

--

Note: I use ‘Love-Matching’ to illustrate how we configure and ensure the compatibility (Quite relevant I think)

It’s not new to connect databases to R or to Python, and write SQL to pull data and feed into the working pipeline: further data processing, feature engineering, model training, etc. on R or Python. This practice is efficient as it makes the logic of pulling data centralised and consistent, recorded in the body of R / Python scripts, rather than pulling multiple datasets separately and saved it somewhere in folders.

In Part 1, we talk about interacting with Athena via browser. In this final part about AWS Athena, I would like to show how to connect Athena with R / Python, further than that Tableau. Finally, SQL Workbench to query Athena in local machine is also introduced.

0a. Install & Set-up

  1. Download & Install Java Development Kit. To check if it’s installed sucessfully, run in Terminal java -version. This is how it is shown up (in Mac Terminal)
  1. Download JDBC
  2. I would assume that you already have R (RStudio) and Python (Jupyter Notebook) in your machine
  3. It’s able to ‘play’ with Athena with local application of SQL Workbench. If you are Mac user, it’s available to install by brew, type in Terminal: brew cask install sqlworkbenchj

It might worth to talk a little bit about ‘why-you-need-what’. To keep it very simple,

  • JDBC is the Java program enables your local machine to be connected with Athena. Once you downloaded, it would have the file name as AthenaJDBC42_2.0.7.jar. You would need to keep in mind the path to this .jar file in your laptop, to instruct the applications (R, Python, Tableau, SQL Workbench, etc.) later.
  • Java Development Kit: Intuitively speaking, you need something to read the .jar file. In fact, JDBC .jar file is run by Java Runtime Environment (included in Java Development Kit.

0b. Get AWS Credentials (Auth. Tokens)

In reality, you could ask the admin (root user) of your system to provide the credentials for you. In case, you have the permission (considering that we work by indivudal accounts in Free Tier), the credentials could be retrieve in IAM.

  1. Starting with AWS Console, we can access to provided services

2. Go to Users, choose the User name corresponding with the account you use to connect with Athena.

3. In Security credentials tab, click Create access key

4. You would be provided Access key ID and Secret access key. These two credentials should be provided for applications (R, Python, Tableau, etc.) to be allowed to connect with Athena (sometimes in more secured settings, token could be required, but the way to set-up is similar). Notice that, you can only see/save credentials one time. So, keep it at some places safe (I would recommend to set credentials as Environment Variables, as in next sections). Or, you could use AWS CLI (Command-Line Interface) to configure AWS credentials (I would write about it sometime later, as AWS CLI is cool to open up more conveniences).

1. Athena in R

  1. Configure Java in R, in your Terminal, type: R CMD javareconf (if your are Mac Users). For Win, it is R COMMAND javareconf(?, please correct me if I’m wrong).

2. Set up AWS credentials in file .Renviron. As I’m lazy to find that file in my machine, I usually use very handy function: usethis::edit_r_environ() (Hint: :: means the function is called from package usethis) and .Renviron would be opened for you, put these two lines in the file. Close and Restart RStudio.
ATHENA_USER = '<put-the-string-of-your-access-key-id>'
ATHENA_PASSWORD = '<put-the-string-of-your-secret-key-id>'

Why we set the Environment Variables? Cause later you can paste your credentials as below in your R Scripts, and freely share your codes with collaborators without revealing your actual credentials string:
Sys.getenv('ATHENA_USER')
Sys.getenv('ATHENA_PASSWORD')

3. Install packages: RJDBC, rJava, DBI (if it throws you errors in Java, check the path and version of Java)

4. A typical code to query from Athena would be as:

2. Athena in Python

  1. For Python, I use PyAthena. Install via conda: conda install -c conda-forge pyathena
  2. Similar to R, I set up AWS credentials into Environmental variables as below (Don’t keep these on the scripts you send to others):
    import os
    os.environ['ATHENA_USER'] = '<put-the-string-of-your-access-key-id>'
    os.environ['ATHENA_PASSWORD] = '<put-the-string-of-your-secret-key-id>'

    So, later we can call them as:
    os.environ['ATHENA_USER']
    os.environ['ATHENA_PASSWORD']
  3. Now, I could run this typical code to query table from Athena:

3. Athena in Tableau

  1. Copy the JDBC .jar file to: ~/Library/Tableau/Drivers (Mac), or C:\Program Files\Tableau\Drivers (Win)
  2. Open Tableau, in Connections, choose Amazon Athena, input AWS credentials
  3. SQL as usual

4. Athena in SQL Workbench

  1. In SQL Workbench, click Manage Drivers

2. Choose the path to JDBC we have downloaded

3. Set up in New Profile: URL (same), Username, Password, S3 query results in Extended Properties

4. Explore the Databases

5. In Statement, you can write SQL to query data. Workbench also supports writing DDL

5. Troubleshooting

  1. Java might need to be configured in your laptop. In human language, the machine knows where Java locates. If it fails to do so, you might need to instruct the machine to recognize the path. In Mac, I would touch and open .bash_profile. In Win, you could access to Settings and set-up the Global Environment Variables Details follows instructions in the Internet (AWS documentations, Java documentations, StakeOverFlow, etc.)
  2. If you have problem with Java, check if the version you downloaded compatible with:
  3. JDBC from AWS
  4. Package 'RJDBC' and 'rJava' in RStudio

This is the end of my series about AWS Athena. As the purpose is introducing, I attempt to keep things as simple as possible. It’s the perk of anyone to explore things furthermore, and if you can, please share your knowledge with us. I hope you like the concepts and the potential of Athena. Personally, I think it is a well-done service, contributing significantly to the spreading of data-driven approaches in multiple aspects of life.

Or, at least, in much smaller scale, it’s a new thing that triggers me to learn with the huge excitement.

I’m not an extremely brilliant learner. So, perhaps my struggle to understand new things would make the learning curve less steep for someone else. Then, I wrote these.

References: As in Hyperlinks

--

--