Inside the AWS re:Invent Session Bot
Today, I posted the code which runs the re:Invent Bot session tracking service. At a high level, this service scrapes the re:Invent session catalog at regular intervals, stores session information in a database, and then if sessions are new or changed, a Tweet is sent out with the session information.
The Code
The code for re:Invent Bot is available here: https://github.com/mda590/reinvent_bot
The code which is used in the ‘live’ implementation of the re:Invent bot are:
- topic_lister/topic_lister.py
- single_session_topic.py
- utils.py
- config.py
The code works together as follows:
- The Topic Listing function (topic_lister.py) is run in a Lambda function and uses Python’s requests and beautifulsoup modules to get session IDs from the re:Invent session catalog:
Topics are used because they divide the re:Invent catalog into the smallest groups of sessions. There appears to be a bug or a ‘quirk’ in the catalog which stops loading new sessions after you’ve scrolled down on the session list a certain number of times. I’m not able to find an exact pattern, but find that getting sessions in the smallest amount of batches works best.
2. For each topic identified, the Lambda function submits a ‘run_task’ API call to Fargate, which uses the REINVENT_TOPIC_ID environment variable to specify which topic a specific Fargate task is responsible for.
3. The Fargate task completes the following tasks:
- Starts an instance of headless Chrome and loads the session catalog for a specific topic’s sessions.
- Chrome scrolls as many times as possible through the “Load More Sessions” option until all of the sessions appear on the screen.
- Chrome passes the HTML for all of the loaded sessions into beautifulsoup.
- Beautifulsoup loops through each of the sessions and validates whether we already have the session information stored or not:
- If a session is new, its information is stored in DynamoDB as version 1 of a session and a tweet is sent with the session information.
- If a session is found in the database and all of the fields match, nothing happens.
- If a session is found in the database, but some of the fields are different, we will store a new version of the session (increment the version #) and send a tweet that the session is updated. We attempt to validate what information changed and specifically list the changes in the Tweet.
- After all of the sessions have been looped through, we store an Item in a separate DynamoDB table indicating the time of the invocation, topic ID, and whether the execution was successful or not.
The Infrastructure
The re:Invent Bot service uses Lambda, Fargate, and DynamoDB.
1: Every 10 minutes, CloudWatch Events invokes a Lambda function.
2: The Lambda function scrapes the topic list from the re:Invent session catalog and gets a list of topic IDs.
3: For each topic ID, a Fargate ‘run_task’ call is made, passing in the topic ID as an environment variable.
4: The Fargate task is responsible for scraping the actual sessions for the passed in topic ID:
4a: Load the re:Invent session catalog session list for the specified topic ID and get the HTML for all of the sessions.
4b: For each session, check within the DynamoDB table whether we already have the session, and if we do, whether it has changed at all.
4c: For new and updated sessions, send out a Tweet via the @reinvent_bot account with the session’s information.
4d: Once complete, put a log into DynamoDB indicating what the session topic ID was, and whether the execution was successful or not.