How-I-Built-It: GitHub Traffic Data Pipeline

Data Technician
5 min readFeb 5, 2024

--

I built a pipeline that collects GitHub traffic data and brings it into PowerBI. This is a technical deep-dive of the steps I took.

For background on why I built it, see this page.

Note: PowerBI is mainly compatible with Windows Operating systems . So when we start getting into local program instructions (e.g., AWS CLI), the syntax becomes Windows-specific.

Tools

Below I’ve listed the purpose of each tool & how much it cost — based on the active development work. Maintenance should hopefully stay under the free tier limits! (Exception: the Simba ODBC driver)

  1. GitHub Actions: To automate & schedule traffic data collection ($0)
  2. AWS DynamoDB: A NoSQL database for storage (free tier up to 25GB, $0)
  3. AWS Glue: An ETL tool that allows you to “crawl” across AWS services, and apply data transformations (there is a free tier and I could’ve been much more efficient here… but I wanted to play with the fancy UI, $35)
  4. AWS S3: Simple storage service (“S3”) — where I can keep transformed data in a format that’s interoperable with other systems (free tier up to 5GB, $0)
  5. 4 tools that were necessary to get data from AWS into PowerBI: Amazon Athena, Athena ODBC Driver, Simba Apache Spark ODBC, and AWS CLI — see below for details
  6. PowerBI Desktop: To create visualizations ($0).
  • Amazon Athena: PowerBI only connects to AWS through Redshift or Athena (Athena has a free tier and you can stay under with really low query volumes & using S3 Select instead where possible, <$1)
  • Simba Apache Spark ODBC Driver: For PowerBI to access your data in Athena, you need a DSN or an Azure AD account. ODBC is a data source administrator tool that allows you to create DSNs. You can download & use Athena’s ODBC Driver for free. However, you need the Simba Driver in order to create the DSN. (free 21-day trial, $0)
  • AWS CLI: To generate temporary authentication credentials, which allow your DSN to access AWS. You cannot do this in AWS Management Console. ($0)

Steps

Code, Links, and References —oh my!

  1. GitHub Actions: Follow the How It Works instructions
  2. AWS DynamoDB: No required action. I did use “Explore Table Items” to confirm that I was grabbing the correct information (e.g., not double-counting).
  3. AWS Glue [1]: Make sure your region is aligned across AWS services! Otherwise, you won’t be able to view/transfer data.
  4. AWS Glue [2]: Create a crawler that reads your DynamoDB table to Glue.
  5. AWS Glue [3]: Create an ETL job to transform your data. I used the UI, and this is the resulting pyscript. There are some limitations to the UI. Specifically, the Data Preview option (a) runs up your bill and (b) doesn’t preview when your output table exceeds a certain size (2MB!). More importantly, when the Data Preview option says “Data Preview Failure”, the columns in the Transform window won’t update. This prevents you from being able to apply a transform (see image below). Finally, “PARTITION KEY” does not save — even though it’s in the script! You must re-add it every time you run the ETL job.
  6. AWS S3: Create an S3 folder and add it to your “Target” (Output) for your Glue ETL Job. You can save some $ on Athena queries by exploring your data with S3 Select. Finally, create a folder for your Amazon Athena queries (to be used in Step 11).
  7. AWS Glue [4]: Create a crawler that reads data from S3 to Athena.
  8. Amazon Athena: I mainly used this to monitor queries when I imported & transformed data in PowerBI.
  9. Amazon Athena ODBC Driver [1]: Install. You need Simba Driver in order to continue.
  10. Simba Driver: Install & follow instructions to copy the .lic folder to the same folder that contains the driver .dll. Once you’ve done this, Simba Athena should automatically appear in the “Drivers” tab of Athena ODBC.
  11. Amazon Athena ODBC Driver [2]: Create create a System DSN. User DSN does not work! Click Configure. Enter your AWS region (remember it must match!). Enter your Athena database name under “Schema”. Enter your “S3 Output Location”. This is the second S3 folder you created in Step 6. For the next steps, you’ll first need to generate credentials in AWS CLI.
  12. AWS IAM Management: You need to create a User Access key & a Role. Upon generating the User Access Key, you’ll receive a Secret Access Key. Copy these. The Role must have the following Policies: AdministratorAccess, S3, and Athena. Copy the Role ARN.
  13. AWS CLI: First, install AWS CLI on your computer. Open a Command Prompt. Run aws configure. Enter AWS User Access Key ID, AWS Secret Access Key, and Default region name. Again, the region must match across your AWS services! Now you’re ready to create your DSN credentials. These expire based on your “Maximum session duration” settings. To generate the credentials, run aws sts assume-role --role-arn arn:aws:iam::{your IAM User account ID}:role/{your AWS Role} -- role-session-name “SessionName" . This should return your credentials: AccessKeyID, SecretAccessKey, and SessionToken.
  14. Amazon Athena ODBC Driver [3]: Select your System DSN. Click “Configure”. Click “Authentication Options”. Select “IAM Credentials” as your Authentication Type. Copy AccessKeyID → User, SecretAccessKey → Password, and SessionToken → Session Token. Click “OK”. To confirm that your DSN is ready, click “Test”. A window should pop up saying “SUCCESS!”.
  15. PowerBI: Open a new file and select “Get Data.” Type “Athena”, and click “Connect”. You’ll be prompted to enter a DSN. This is the “System DSN” name that you created in Athena ODBC. You’ll also need to enter a SQL query to SELECT data from your Athena table. And voila! Note: The “Import” Data Connectivity Mode doesn’t behave in a way that lets you use the data without a live ODBC connection to Athena. Your transforms & “Apply to/Refresh visuals” don’t work after your ODBC/AWS CLI session times out. And these operations run Athena queries.
Beware of “Data Preview Failures!” When this occurs, the columns in Transform steps like “Change Schema” (pictured) may not be synced with the actual schema. For example, say that you added a column in a previous step and you want to transform it here. It may not show up if the output was too big in a previous step. To work around this, you can apply an aggressive filter at an early step (e.g., to only show a few records). Then re-initiate every “Data Preview” until you see the expected columns. After you’re done, you can go back & remove your aggressive filter and Run the ETL Job.

There you have it!

It turns out, it’s tough to stay within free tiers when you’re on a learning curve. Admittedly, I ran a few trials (and errors) in Glue and a more experienced user could optimize for costs. However, I learned from the Glue UI and got a functional PySpark script that’ll save me for my next project!

I hope you learned something new.

And I’d love to learn any tips from your successes & challenges with these tools!

--

--

Data Technician

I use data & domain research to solve problems and discover stories. From healthcare & B2B to hospitality, I always keep the consumer at the heart of what I do.