Building a Data Pipeline with Riot Games API and AWS: Gathering and Ingesting Data
--
Unlocking the Power of Gaming Data: A Journey from the Fields of Justice to the Clouds of AWS
Hey there, fellow data nerds and League of Legends enthusiasts! So, you’ve ever wondered what kind of stories the massive piles of gaming data are hiding? If the answer is a resounding ‘yes’, then you’re in the right place!
Picture this: we’re taking Riot Games API, pulling out some cool stats about our favorite champions, and then digging even deeper with a bit of web crawling to get information like win rates and ban rates. Pretty neat, right?
But wait, it gets better! We’re not just stopping at gathering data; oh no, we’re going all in, building a data pipeline using the beast that is Amazon Web Services (AWS). Think Firehose for data ingestion, S3 buckets for storage, and a whole arsenal of other AWS tools that we’ll unleash in the coming stages of the project.
This isn’t just about gaming stats; it’s a real-life, hands-on adventure through the AWS ecosystem. We’ll see how the different pieces of the puzzle, like Firehose, Glue, Athena, Redshift, and QuickSight, fit together to make data magic happen.
By the end of this ride, we’ll have a slick data pipeline that transforms raw LoL stats into insights we can use, contributing to the way we understand and strategize in the game. And trust me, it’s going to be an epic journey.
Whether you’re here for the data, the gaming, the AWS, or all of the above, I’ve got you covered. So, strap in, get comfortable, and let’s dive headfirst into the riveting world of gaming data!
1. Setting Up Your Environment
Before we dive deep into the data pool, we’ve got some groundwork to cover. Think of it as setting up base camp before climbing Everest. It’s not the most glamorous part, but it’s crucial to make the upcoming journey smooth sailing. So, grab your favorite brew and let’s get down to it.
Your Local Setup
Whether you’re a Windows fan, a macOS enthusiast, or a Linux loyalist, you’ll need a local dev environment set up. You’ll want Python installed — we’re gonna use it to interact with Riot’s API and do our web crawling. You’ll also need some additional libraries, namely requests
for sending HTTP requests and beautifulsoup4
for parsing HTML. Oh, and don’t forget cassiopeia
, the python library for Riot's API. You might want to set up a virtual environment to keep things tidy, but that’s up to you.
Our game plan here is pretty straightforward. We’re going to snag CSV files brimming with juicy data parsed from web crawling and Riot’s API, and these files will chill on your local computer first. Once we’ve got our hands on the data, we’ll ship it off to our EC2 instance.
You might be scratching your head, wondering why we’re taking the scenic route. Why not just haul the data directly using our EC2 instance? Here’s the deal: these CSV files are not behemoths. We’re not lugging mountains of data around, so it’s all good to grab everything locally before sending it up to the cloud.
Now, if we were wrestling with a mammoth app that’s always gobbling up data from an API, sure, we could totally set it up to run directly on the EC2 instance. Heck, we could even go the extra mile and toss it into a Kubernetes pod if we felt like flexing our tech muscles a bit. But let’s be real. Our main squeeze here is AWS and its stellar services, so there’s no need to make things more complex than they need to be. Let’s keep the bells and whistles to a minimum and stick to the basics, shall we?
AWS: Our Playground
Next stop, AWS. We’ll be using an EC2 instance here — basically, a virtual server in Amazon’s cloud. If you’re new to AWS, you’ll need to create an account first. Don’t worry, there’s a free tier you can use, so you won’t have to break the bank for this project.
Once you’ve got your account set up, head over to the AWS Management Console and click on “Services”, then “EC2”. Click the “Launch Instance” button and follow the wizard. You’ll need to choose an Amazon Machine Image (AMI) — think of this as the blueprint for your instance. You’ll also select an instance type, configure the instance details, add storage, and finally, review and launch. Boom, your EC2 instance is up and running!
But we’re not done yet. We also need to create an S3 bucket. This is where we’ll store our csv files that will be generated from our data. To do this, head back to the AWS Management Console, click on “Services”, and then “S3”. Click “Create bucket” and follow the prompts. Make sure to keep the bucket private to ensure your data is secure.
One more thing to note: make sure to download the key pair file (.pem) during the setup process. You’ll need this to connect to your instance.
And there you have it — the nuts and bolts of our setup. With our local environment and EC2 instance ready to go, we’re all set.
2. Gathering Data
Now that we’ve got our local and cloud environments all set, it’s time to dive headfirst into the data pool. This is where our Python skills come into play.
The Power of Cassiopeia
Before we go any further, there’s an important step you need to take if you’re following along. To access Riot’s API, you need an API key. To get that, you’ll have to head over to the Riot Developer Portal, sign up, and request an API key. Once you have your key, keep it handy. We’ll be using it shortly.
Now, let’s start by diving into the code that we’ll use to access Riot’s API through cassiopeia
. Essentially, we're using cassiopeia
to tap into the treasure trove of data that the API offers. Here is the code to gather data using the cassiopeia
API:
import csv
import logging
from typing import Dict, List
from cassiopeia import Champions
from cassiopeia import configuration
def get_api_key(filepath: str) -> str:
'''Read API Key form a file.'''
try:
with open(filepath, 'r') as file:
api_key=file.read().strip()
return api_key
except FileNotFoundError as e:
logging.error(f'Error opening API keyfile {e}')
return None
def configure_cassiopeia(api_key: str):
'''Configure cassiopeia with the API key.'''
configuration.settings.set_riot_api_key(api_key)
def get_champion_data(champions: Champions) -> (List[Dict], List[Dict]):
'''Get data for all champions'''
champion_core_data = []
champion_stats = []
for champion in champions:
core_data = {
'champion_id': champion.id,
'name': champion.name,
'title': champion.title,
'difficulty': champion.info.difficulty,
}
champion_core_data.append(core_data)
stats_data = {
'champion_id': champion.id,
'armor': champion.stats.armor,
'armor_per_level': champion.stats.armor_per_level,
'attack_damage': champion.stats.attack_damage,
'attack_damage_per_level': champion.stats.attack_damage_per_level,
'attack_range': champion.stats.attack_range,
'attack_speed': champion.stats.attack_speed,
'critical_strike_chance': champion.stats.critical_strike_chance,
'critical_strike_chance_per_level': champion.stats.critical_strike_chance_per_level,
'health': champion.stats.health,
'health_per_level': champion.stats.health_per_level,
'health_regen': champion.stats.health_regen,
'health_regen_per_level': champion.stats.health_regen_per_level,
'magic_resist': champion.stats.magic_resist,
'magic_resist_per_level': champion.stats.magic_resist_per_level,
'mana': champion.stats.mana,
'mana_per_level': champion.stats.mana_per_level,
'mana_regen': champion.stats.mana_regen,
'mana_regen_per_level': champion.stats.mana_regen_per_level,
'movespeed': champion.stats.movespeed,
'percent_attack_speed_per_level': champion.stats.percent_attack_speed_per_level
}
champion_stats.append(stats_data)
return champion_core_data, champion_stats
def write_to_csv(data: List[dict], fields: List[str], filename: str) -> None:
'''Write data to CSV file'''
try:
with open(filename, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
except Exception as e:
logging.error(f'Error writing to CSV file: {e}')
def main():
logging.basicConfig(level=logging.INFO)
api_key = get_api_key('api_key.txt')
if not api_key:
logging.error('No API key found. Exiting...')
return
configure_cassiopeia(api_key)
champions = Champions(region='EUW')
# Get champions data and write to CSV
champion_core_data, champion_stats_data = get_champion_data(champions)
if not champion_core_data:
logging.error('No champion core data found. Exiting ...')
return
fields_core = list(champion_core_data[0].keys())
write_to_csv(data=champion_core_data, fields=fields_core, filename='champion_core.csv')
if not champion_stats_data:
logging.error('No champion stats data found. Exiting...')
return
fields_stats = list(champion_stats_data[0].keys())
write_to_csv(data=champion_stats_data, fields=fields_stats, filename='champion_stats.csv')
if __name__ == '__main__':
main()
First up, we need to provide cassiopeia
with our API key. The get_api_key
function opens a file where we've stored our API key and reads it. If there's an error, it logs the issue and returns None
.
Once we have our key, we feed it into the configure_cassiopeia
function, which sets up cassiopeia
with our API key.
Now comes the fun part — fetching the data. We have a unified function get_champion_data
which receives the list of champions and returns both core information and stats about them. The core information includes their id, name, title, and difficulty. The stats go deeper, pulling out a heap of statistics for each champion, like armor, attack damage, health, and more.
Each function loops over all the champions in the game, pulling out the desired data for each one and storing it in a dictionary. These dictionaries are then added to a list, giving us a list of dictionaries — one for each champion — each filled with useful data.
The last piece of the puzzle is writing this data to CSV files. The write_to_csv
function does just this, taking our list of dictionaries and a list of field names, and writing everything to a CSV file. If there's an issue, it logs an error message.
Finally, we have the main
function. This is where everything comes together. It starts by fetching our API key and configuring cassiopeia
. Then it calls our data-fetching function to get our data, checks to make sure we got something, and finally writes everything to CSV files.
And that’s it! We’ve now got a neat script that uses cassiopeia
to pull tons of useful data from Riot's API, storing it in CSV files ready for further analysis.
Web Crawling: The Other Half of the Data Story
While the API gives us some amazing data, we’ll need to supplement it with a dash of web crawling to get a more comprehensive view of the game’s stats. This is where the beautifulsoup4
library comes into play.
Running our second Python script will send our web crawler on a mission to pull out additional stats and details from various websites and pages. The aim here is to dig out the data gems that aren’t available through Riot’s API.
Before we dive into the code, let’s get some context about what we’re trying to do. We are interested in scraping additional stats from a website called u.gg. This website provides a wealth of information on individual champions, such as their roles, matches played, win rate, pick rate, and ban rate. For each champion, we’re going to visit their individual page on this website and extract this information.
Alright, enough talk, let’s see some action!
import csv
import os
import requests
import time
from bs4 import BeautifulSoup
import logging
logging.basicConfig(level=logging.INFO)
def get_names_from_csv(file_path):
'''Get the list of champion names from the existing CSV file. If the file doesn't exist, return an empty list.'''
if os.path.exists(file_path):
names = []
with open(file_path, 'r') as file:
reader = csv.reader(file)
next(reader, None) # skip the headers
for row in reader:
names.append(row[0])
return names
else:
return []
def fetch_and_parse_page(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def fetch_champion_data(champion, base_url):
# Visit the champion's individual page
champion_url = base_url + champion.lower() + "/" + "build"
champion_soup = fetch_and_parse_page(champion_url)
try:
role_value = champion_soup.find('span', class_='champion-title').text.strip().split('Build for ')[1].split(', ')[0]
matches_value = champion_soup.find('div', class_='matches').find('div', class_='value').text.strip()
win_rate_value = champion_soup.find('div', class_='win-rate').find('div', class_='value').text.strip()
pick_rate_value = champion_soup.find('div', class_='pick-rate').find('div', class_='value').text.strip()
ban_rate_value = champion_soup.find('div', class_='ban-rate').find('div', class_='value').text.strip()
except Exception as e:
logging.error(f'Error fetching data for {champion}: {str(e)}')
return None
logging.info(f'{champion} done.')
time.sleep(2) # Consider making this configurable
return [champion, role_value, matches_value, win_rate_value, pick_rate_value, ban_rate_value]
def main():
names = get_names_from_csv('champions_statistics.csv')
base_url = "https://u.gg/lol/champions/"
soup = fetch_and_parse_page(base_url)
# Assuming that the champions are in a list
champions_list = soup.find_all('div', class_='champion-name')
# Write to CSV
write_header = not os.path.exists('champions_stats.csv')
with open('champions_stats.csv', 'a', newline='') as file:
writer = csv.writer(file)
if write_header:
writer.writerow(["Name", "Role", "Matches", "Win Rate", "Pick Rate", "Ban Rate"]) # Header
for champion in champions_list:
champion_name = champion.text.strip()
if champion_name not in names:
champion_data = fetch_champion_data(champion_name, base_url)
if champion_data:
writer.writerow(champion_data)
if __name__ == '__main__':
main()
Let’s break down how the code works:
- Import necessary libraries: We import the necessary libraries, including csv, os, requests, time, BeautifulSoup from bs4, and logging. The logging library is included to monitor the progress of our script and catch any errors that may occur.
- get_names_from_csv function: This function checks if there is already an existing ‘champions_stats.csv’ file. If it exists, it reads the file and collects all the champion names already fetched and stored. If not, it returns an empty list, which indicates that we have no previously fetched champions.
- fetch_and_parse_page function: This function takes a URL as an argument, makes a request to the website, and parses the HTML using BeautifulSoup.
- fetch_champion_data function: This function fetches and parses the champion’s individual page to collect specific data such as role, matches, win rate, pick rate, and ban rate.
- main function: Here’s where the main operation occurs. First, we get a list of champion names we already have from the CSV file. We fetch and parse the page containing all the champions, and then, for each champion not already in our CSV, we fetch their individual data and add it to our CSV file.
In the main function, we added a new line to check whether the CSV file exists before writing to it. If it does not exist, we write the header first before appending the champion data.
The benefit of this approach is that we are being considerate to the server we are scraping from by not sending too many requests at once, which can sometimes lead to being blocked. Also, if the script stops or breaks due to internet connection issues or the website going down, we can always rerun the script, and it will pick up where it left off instead of starting from the beginning.
Moving the Data Upstream
Once our data collection scripts have finished running and we have our CSV files ready, the next step is to move our data upstream, in this case, to an Amazon EC2 instance. EC2 is a web service from Amazon that provides resizable compute capacity in the cloud, making web-scale cloud computing easier for developers.
Before we start the data transfer, it’s essential to ensure that the EC2 instance is up and running. You can do this by logging into your AWS account and navigating to the EC2 management console. Here, you should be able to see the status of your instances. If your instance is stopped, you can simply click ‘Actions’ and then ‘Instance State’ to start it up again.
After confirming that your EC2 instance is running, we can use the scp
command to copy files from our local machine to the EC2 instance. The scp
(secure copy) command uses SSH to transfer data, so it requires a set of SSH credentials, specifically the .pem key file you get when you create your EC2 instance.
Here’s an example of how to use scp
to copy all CSV files from your local directory to your EC2 instance:
scp -i /path/your-key-pair.pem /path_to_csv_files/*.csv ec2-user@your_ec2_public_ip:/home/ec2-user/
Let’s break this command down:
-i /path/your-key-pair.pem
: This specifies the path to the .pem key file you received when you set up your EC2 instance. You need this to authenticate your connection./path_to_csv_files/*.csv
: This represents the local file(s) you want to copy to the EC2 instance. The*.csv
part is a wildcard that matches any CSV file in the specified directory.ec2-user@your_ec2_public_ip
: This part is the username and the public IPv4 of your EC2 instance. The username (ec2-user
in this case) may vary based on the AMI that you have used for your instance.:/home/ec2-user/
: After the colon, you specify the destination directory on the EC2 instance.
After running this command, all CSV files in your specified local directory will be securely copied to the specified directory on your EC2 instance. To verify that the files have been successfully transferred, you can SSH into your EC2 instance and navigate to the /home/ec2-user/ directory where the CSV files have been copied to.
Alternatively, you can connect to your EC2 instance directly from the AWS Management Console. Just navigate to your EC2 dashboard, select the instance you want to connect to, and then click on the ‘Connect’ button. A pop-up window will guide you through the necessary steps to establish a connection using EC2 Instance Connect.
If everything went accordingly to the initial plan, you should be able to check the following files:
3. Understanding the AWS Kinesis Data Firehose
Meet AWS Firehose, or its full name — Amazon Kinesis Data Firehose. It’s like the express delivery service for your streaming data, getting it where it needs to go in near real-time. Whether it’s Amazon S3, Redshift, Elasticsearch Service, or Splunk — Firehose has got you covered. The best part? It automatically scales to meet your data’s size, so you don’t have to lift a finger.
So what’s Firehose’s role in our data journey? It’s going to be our trusty conveyor belt, chugging along the CSV files filled with our game data, and dropping them off into our S3 bucket. Here, our data can kick back, relax, and await further analysis.
Our task? Setting up three delivery streams — one for each CSV file. Think of a delivery stream as Firehose’s backbone, built to ingest, transform, and dispatch data to its destination.
But before we roll out these delivery streams, we’ve got a bit of housekeeping to do in our S3 bucket. We’re going to create three separate folders, one for each CSV file. This way, each data set gets its own private suite, making it a breeze for us to manage and analyze them later.
To create a delivery stream, navigate to the Kinesis Firehose service in the AWS Management Console and click on “Create delivery stream”. Here, you will need to configure the source and destination for your delivery stream.
For the source, select ‘Direct PUT or other sources’. This option will allow us to use Kinesis Agent, installed on our EC2 instance, to push data directly into Firehose. For the destination, choose ‘Amazon S3’.
You’ll be prompted to select your S3 bucket and input the aforementioned directory structure:
This will help organize our data chronologically in the S3 bucket, facilitating future analysis. Remember to use the folder for the respective CSV file you’re ingesting.
If there are any issues with data delivery, Firehose supports error logging. For the ‘S3 bucket error output prefix’. This creates a detailed error log to aid troubleshooting any issues.
Lastly, we’ll set our buffer size interval to 60 seconds. This configuration determines how long Firehose buffers incoming data before delivering it to S3. Both the buffer size and buffer interval can be adjusted based on the speed and volume of incoming data.
After creating and configuring the delivery streams, Firehose will automatically ingest our game data and deliver it to our S3 bucket. This fully managed, scalable, and automated process reduces much of the heavy lifting, allowing us to focus on extracting insights from our data.
After creating and configuring the delivery streams, we’re ready for the next important step — setting up the Kinesis Agent on our EC2 instance.
The Kinesis Agent is a stand-alone Java software application that offers an efficient way to collect and send data to Firehose. By installing and configuring this agent on our EC2 instance, we can facilitate the ingestion of our CSV data into the Firehose delivery streams we’ve just created.
4. Data Ingestion
At this point, we’ve set up our AWS resources, our local environment, and prepared our EC2 instance, as well as the Firehose delivery streams. Now, it’s time to get our data flowing. We’ll do this by configuring the Kinesis Agent on our EC2 instance. The agent is a powerful tool that enables secure and reliable data transfer to AWS Kinesis services, in our case, Firehose. Let’s get to it!
Configuring the Kinesis Agent
After you’ve connected to your EC2 instance (either via SSH or the AWS Management Console), install the Kinesis Agent. You can do this by running the following command:
sudo yum install –y aws-kinesis-agent
Once installed, we need to configure the agent. It uses a configuration file (agent.json) to understand where your source files are and which Kinesis Firehose delivery stream to put them into.
First, let’s create a new folder and move our files there:
sudo mkdir /var/log/league_data
sudo mv /home/ec2-user/*.csv /var/log/league_data/
Open the Kinesis Agent configuration file in a text editor:
sudo vim /etc/aws-kinesis/agent.json
The configuration file needs to be set up to watch the directory where we’ll be placing our CSV files, and to forward them to the corresponding Firehose delivery stream. The file will look something like this:
{
"cloudwatch.emitMetrics": true,
"kinesis.endpoint": "",
"firehose.endpoint": "firehose.eu-north-1.amazonaws.com",
"flows": [
{
"filePattern": "/var/log/league_data/champion_core.csv",
"deliveryStream": "ChampionCoreData",
"initialPosition": "START_OF_FILE",
"dataProcessingOptions": [
{
"optionName": "CSVTOJSON",
"customFieldNames": ["champion_id","name","title","difficulty"]
}
]
},
{
"filePattern": "/var/log/league_data/champion_statistics.csv",
"deliveryStream": "ChampionStatisticsData",
"initialPosition": "START_OF_FILE",
"dataProcessingOptions": [
{
"optionName": "CSVTOJSON",
"customFieldNames": ["name","role","Win Rate", "Pick Rate","Ban Rate"]
}
]
},
{
"filePattern": "/var/log/league_data/champion_core.csv",
"deliveryStream": "ChampionGameStatsData",
"initialPosition": "START_OF_FILE",
"dataProcesssingOptions": [
{
"optionName": "CSVTOJSON",
"customFieldNames":["champion_id","armor","armor_per_level","attack_damage","attack_damage_per_level","attack_range","attack_speed","critical_strike_chance","critical_strike_chance_per_level","health","health_per_level","health_regen","health_regen_per_level","magic_resist","magic_resist_per_level","mana","mana_per_level","mana_regen","mana_regen_per_level","movespeed","percent_attack_speed_per_level"]
}
]
}
]
}
Let’s break down the important parts of this configuration:
"firehose.endpoint"
: Here, you need to specify your Firehose endpoint. Remember to replace "eu-north-1" with your own AWS region."flows"
: This array contains configuration for each CSV file and its associated Firehose delivery stream. You'll define the source file path, the target delivery stream, and data processing options."filePattern"
: This is the file path on your EC2 instance where the agent will monitor for new data."deliveryStream"
: This should be the name of the delivery stream to which the data will be sent."initialPosition"
: This is the position in the file where the agent will start reading."dataProcessingOptions"
: Here, we are transforming the CSV data into JSON and specifying our CSV column names.
Once you’ve entered your configuration, save and close the file. If you’re not used to vim, just type :wq!
After finishing the Kinesis Agent configuration, it’s crucial to make sure that the CSV files have the correct permissions set for the Kinesis Agent to read them.
The Kinesis Agent runs as the aws-kinesis-agent-user
and therefore needs read access to the files it is set to monitor. To provide these permissions, you can use the chown
and chmod
commands.
To change the owner of your CSV files to the aws-kinesis-agent-user
, navigate to the directory containing your files and use the chown
command:
sudo chown aws-kinesis-agent-user:aws-kinesis-agent-user /path_to_your_csv_files/*.csv
Next, use the chmod
command to ensure the Kinesis Agent has read permissions:
sudo chmod 644 /path_to_your_csv_files/*.csv
Also, it’s essential to ensure that the aws-kinesis-agent-user
has the necessary permissions to traverse into the directory containing the files. The aws-kinesis-agent-user
needs to have execute permissions for all parent directories. You can set these permissions using:
sudo chmod o+x /path_to_your_parent_directory
In the above command, replace /path_to_your_parent_directory
with the path of the directory containing your CSV files.
This should set the permissions correctly for the Kinesis Agent to read your CSV files. Now the agent will be able to monitor these files and send them to your Firehose delivery streams as intended
Before we start the Kinesis Agent, we must ensure that the EC2 instance has the appropriate permissions to push data to Kinesis. This is done by creating an IAM role, assigning the necessary permissions, and attaching it to the EC2 instance.
But before we jump in, let’s take a step back and look at why we’re customizing the policy. Ever heard of the principle of least privilege? Well, it’s a fancy term for giving your services just the right amount of access they need to do their job and not a smidge more. It’s a best practice in the security world, and that’s what we’re adhering to.
Let’s kick things off by crafting our custom policy. The policy will look like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FirehoseAccess",
"Effect": "Allow",
"Action": [
"firehose:PutRecord",
"firehose:PutRecordBatch"
],
"Resource": [
"arn:aws:firehose:eu-north-1:343860449002:deliverystream/ChampionCoreData",
"arn:aws:firehose:eu-north-1:343860649002:deliverystream/ChampionGameStatsData",
"arn:aws:firehose:eu-north-1:343860149002:deliverystream/ChampionStatisticsData"
]
}
]
}
This custom policy is doing us a favor by allowing our EC2 instance to put records into our Firehose streams, but nothing more, nothing less. Keep in mind that you need to replace the arn values with yours.
Next, we’re going to attach this custom policy to an IAM Role. So, head over to the IAM dashboard on your AWS console and find ‘Roles’ in the left sidebar, then click on ‘Create Role’. When choosing the ‘type of trusted entity’, go for ‘AWS service’, then ‘EC2’, and hit ‘Next: Permissions’.
Now, we’re in the ‘Attach permissions policy’ page. Here, click on ‘Create policy’, paste the JSON above, review, and create the policy. You should be able to find your new policy by searching for its name. Select it and click ‘Next: Tags’. If you’re in a tagging mood, go ahead and add some. They’re like little post-its to help you remember what’s what. After adding your tags, click ‘Next: Review’.
In the final step, give your role a name and description that’s easy to remember, and click ‘Create role’.
Finally, it’s time to bring our EC2 instance into the mix. Jump over to the EC2 dashboard, select your instance, and click on ‘Actions’. In the dropdown that appears, go to ‘Security’, and then ‘Modify IAM role’. In the ‘IAM role’ dropdown, select the role you’ve just created and then click ‘Apply’.
And that’s it! You’ve just secured your Firehose streams using the principle of least privilege.
Starting the Agent Service
With the Kinesis Agent configured, the last step is to start it. Here’s how:
Run the following command to start the Kinesis Agent:
sudo service aws-kinesis-agent start
You can check the status of the Kinesis Agent at any time with this command:
sudo service aws-kinesis-agent status
To keep tabs on the real-time activity of the service, you can use a helpful command that pulls information straight from the service logs. Here’s how you can do it:
tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log
This command opens up a live view of the Kinesis Agent log file. It will provide you with an ongoing stream of log updates, which will allow you to track the Agent’s operations as they happen. You can monitor the number of records it has processed, the data it has successfully transmitted, the uptime, and more. The output will look something like this:
Well, we’re dealing with technology here, folks! As much as we’d like everything to be smooth sailing, we might encounter a few hiccups on the journey. No worries, though. Let’s dive into some common roadblocks you might bump into and how to steer around them.
The Case of the Non-Parsing Agent
- Could it be that your
Agent.json
config isn't exactly pointing the agent to the right spots? Check your file paths and names. This sneaky little issue often causes unparsed records. - Double-check your CSV files. Do they look clean and well-structured? If your data resembles a Picasso painting more than a neat table, your agent might throw its hands up and refuse to parse.
The Silent Agent
- Is your agent giving you the silent treatment? Not sending data to your delivery stream? It might be a case of “no entry”. Make sure your EC2 instance is flashing the right ID card — that is, it has the correct permissions via your IAM roles.
- And hey, never forget your agent is a talker, logging its thoughts and actions at
/var/log/aws-kinesis-agent/aws-kinesis-agent.log
. Dive into the logs for insights into what's going wrong.
The Empty Bucket Syndrome
- Got your data delivery up and running, but your S3 bucket is still empty? Do a quick double-check. Ensure your delivery stream and S3 bucket are on the same page and are ready to tango together.
- Also, remember, patience is a virtue. Your data might take a leisurely stroll from your agent to the S3 bucket, depending on your buffer settings and data volume.
Keep in mind, AWS isn’t one to keep secrets. If something’s wrong, you’ll find clues in the logs. And remember, in the world of data pipelines, resilience is key. Hang in there, keep refining, and you’ll get your pipeline humming in no time!
Conclusion
Alright folks, if you’ve followed along and everything has gone according to plan, your precious data should now be nestled safely inside your S3 bucket. You’ve done it! You’ve set up a reliable pipeline to transfer your data from local CSV files, through the EC2 instance and Firehose, and finally into S3. It’s like you’ve built a highway for your data, ensuring a smooth and secure journey to its destination.
But hold on to your hats, because the ride isn’t over yet. Now that our data is securely stored, our next challenge is to transform this raw information into valuable insights — the gold in our data lake. In our upcoming articles, we’re going to continue this thrilling data journey. We’ll dive into how to refine and polish this raw data, to help you discover valuable nuggets of knowledge hidden within.
So stay tuned, keep exploring, and remember: every step you take in understanding your data is a step towards making more informed, insightful decisions. Until next time, happy data journeying!