How to collect data like a spy — Part 5

Connecting rStudio to Amazon Athena

Prerequisites

Before you get started, complete the following steps.

  • Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console.
  • Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athena-staging in the instructions that follow.

Set up R and RStudio on EC2

Create yourself a machine with at least T2.medium.

When you launch an instance in EC2, you can pass in user data that can be used to perform common automated configuration tasks and even run scripts for installation after the instance starts. In the EC2 launch wizard, you can add this at the Configure Instance Details step by expanding the Advanced Details pane.

Use the following script, which will install RStudio.

#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.136-x86_64.rpm
yum install -y — nogpgcheck rstudio-server-rhel-1.0.136-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd

Install Java 8

SSH into this EC2 instance.
Remove older versions of Java.
Install Java 8. This is required to work with Athena.
Run the following commands on the command line.

#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives — config java
#run command below to add Java support to R
sudo R CMD javareconf
#following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel

Set up .Renviron

You need to configure the R environment variable .Renviron with the required Athena credentials.

Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
Type the following command from the Linux command prompt and bring up the nano editor.

sudo nano /home/rstudio/.Renviron
Provide your Athena credentials in the following form into the editor:
ATHENA_USER=< AWS_ACCESS_KEY_ID >
ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>

Log in to RStudio

Next, you’ll log in to RStudio on your EC2 instance.

Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.

Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.

Log in to RStudio with the username and password you provided previously.

Install R packages

Time to install the R packages.

# — following R packages are required for connecting R with Athena
install.packages(“rJava”)
install.packages(“RJDBC”)
library(rJava)
library(RJDBC)
# — following R packages are required for the interactive application we build later
# — steps below might take several minutes to complete
install.packages(c(“plyr”,”dplyr”,”png”,”RgoogleMaps”,”ggmap”))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)

Connect to Athena

The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.

#verify Athena credentials by inspecting results from command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- ‘https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
fil
#set up driver connection to JDBC
drv <- JDBC(driverClass=”com.amazonaws.athena.jdbc.AthenaDriver”, fil, identifier.quote=”’”)
#connect to Athena using the driver, S3 working directory and credentials for Athena 
#replace ‘athenauser’ below with prefix you have set up for your S3 bucket
con <- jdbcConnection <- dbConnect(drv, ‘jdbc:awsathena://athena.us-east-1.amazonaws.com:443/’,
s3_staging_dir=”s3://athenauser-athena-r”,
user=Sys.getenv(“ATHENA_USER”),
password=Sys.getenv(“ATHENA_PASSWORD”))
#in case of error or warning from step above ensure rJava and RJDBC packages have #been loaded 
#also ensure you have Java 8 running and configured for R as outlined earlier

Test Athena Connection

Run the following within RStudio, you should get a list of Athena databases.

# get a list of all tables currently in Athena 
dbListTables(con)

If this works, we’re ready to role onto the next part.

The Series

Part One — How to collect data like a spy
Part Two — Getting NiFi up and running
Part Three — How to collect social media data like a pro
Part Four — Creating a database with AWS Athena
Part Five — Connecting RStudio to Athena
Part Six — Creating Maps of the Data in RStudio
Part Seven — Creating an interactive dashboard for your data

Show your support

Clapping shows how much you appreciated Mark Craddock’s story.