Automated Shutdown of Amazon Sagemaker Studio Applications

Nick Gardner
My Journey Towards Data Science
6 min readJun 21, 2022
£ $ Time is money friend! £$

Problem Statement & Article Intention

The problem this article attempts to solve is the issue that Amazon Sagemaker has no native functionality to automatically retire inactive Studio applications. Manually shutting down applications for a large team will soon become tiresome.

There are many solutions for terminating inactive Sagemaker Studio applications, like this one by Sofian Hamiti, an AWS Solutions architect. That said, there are few solutions that offer a way of retiring resources after a given period of inactivity, especially resources where detail like the resource type and purpose matter. For example, it is inconvenient to shut down the Jupyer Servers of all users due to an extended period of inactivity, i.e overnight. In this scenario will be required to wait an extended period for a new Jupyter server to be allocated the next morning and costs associated with the Jupyter server instance are negligible to begin with. A better solution would be to ‘skip over’ these instances but not other application instances of other types.

This solution aims to closing down inactive Amazon Sagemaker Applications via a Lambda function triggered by EventBridge. This solution will shutdown all inactive applications that are not Jupyter Servers post a period of 8 hours of inactivity (approximately). Amazon EventBridge will run the script every 30 minutes to check if the instances are still active.

This article assumes an understanding of what Amazon Sagemaker, AWS Lambda and Amazon EventBridge are. Please consider familiarizing yourself with the boto3 API if not done so already, this will aid understanding of the pythonic part of the solution.

Solution Architecture

Architecture

The architecture is extremely simple consisting of three parts:

  1. Amazon EventBridge: An event-driven scheduler
  2. AWS Lambda: Serverless place to run scripts
  3. Amazon Sagemaker: A collection of Data Science services

In summary, Amazon EventBridge will trigger an ‘event’ every 30 minutes as specified by a cron expression. AWS Lambda will run a script that utilises boto3 to get Sagemaker app data with statuses and Cloudwatch data regarding those applications. The script will interpret Amazon SageMaker App data, Cloudwatch logs regarding App data and then will shut down idle instances where there is no log activity for more than 8 hours as a result. Simple right? OK, let me show you how to set this up and provide a working script.

Step 1) AWS Lambda

First set up the lambda function. Go to AWS Lambda within your console and select ‘Create Function’

Select ‘Create Function’

Select ‘Author from scratch’ and the following settings within ‘Basic Information’. You may need to configure ‘Advanced Settings’ however I cannot illustrate exactly what to implement here due to security reasons. If you are operating within an organisation’s VPC you may need to work with your organisation’s administrator to understand your VPC’s configuration. The VPC is where your Lambda function will run, and so it requires adequate access controls. To attach a VPC to your Lambda function select Advanced settings →tick ‘Enable VPC’ and select the VPC from a dropdown.

Basic Settings — Lambda Function
Advanced Settings — Adding a VPC

Once you have added your VPC you will need to attach the subnets and security groups to your VPC. This is so your Lambda function can communicate with other AWS services. If your Lambda function cannot communicate with other services

  1. Check permissions of the Lambda function
  2. Check the security group’s inbound and outbound rules/protocols (see VPC within the services menu). Ensure ports required are open.

Note that if timeout errors occur, this can be resolved within the configuration tab → general configuration → edit → Basic settings → adjust the ‘Timeout’ field → select save. If you have permission restrictions your Lambda function may require additional role rights to access Sagemaker and Cloudwatch. Check out ‘Permissions’ within ‘General configuration’ to investigate this.

Assuming the creation of a Lambda function with correct permissions and timeout settings, the lambda_function.py should contain similar to the following:

As aforementioned, this code gets a Sagemaker object using boto3 which allows you to interpret Sagemaker Application data. There is a function to get Sagemaker active instances and a function to get Cloudwatch logs within the aws/sagemaker/studio namespace. There is also a function to delete the applications that require deletion. def lambda_handler is the ‘main’ function that is triggered by EventBridge.

Side Note: One might ask “why not just use the Sagemaker API?”. Well, during development I noticed a counterintuitive implementation by Amazon. Regarding the Sagemaker API, I notice that Amazons documentation states:

LastUserActivityTimestamp is also updated when SageMaker performs health checks without user activity. As a result, this value is set to the same value as LastHealthCheckTimestamp .

This is unhelpful since the name of the timestamp does not function as its name suggests. A feature request made to Amazon is for a new timestamp containing only the latest user activity (fingers crossed).

AWS has confirmed a bug with this field so check back in the future regarding the Last User Activity Timestamp and perhaps Cloudwatch logs will no longer be required to be parsed.

“Unfortunately at this moment there is a bug which causes LastUserActivityTimestamp to be set to the same value as lastHealthCheckTimestamp.” — AWS Support

Step 2) Amazon EventBridge (CloudWatch Events)

Within the Lambda function configuration screen select ‘Add trigger’. The below shows a curtailed view of the result. I think you can probably see where this is going by now…

Curtailed Image of ‘Add Trigger’ Screen

A menu will appear to add the trigger. Select EventBridge within the dropdown and type in a rule name and description after selecting ‘create a new rule’. Finally, enter your schedule expression which in this case is a cron job to run every 15 minutes. For more details on creating cron expression within AWS, here is a good resource.

EventBridge Trigger

Select ‘Add’ and that’s it, you are done! But wait, test it first before you claim victory?

Step 3) Testing

You can test by creating a Sagemaker app instance and reversing the logic within the Lambda function to close down all applications opened within 8 hours. Redeploy the code within Lambda and the Lambda function should shut down your newly created app instance. Revert the logic and redeploy to achieve the intended functionality.

Conclusions

It would be beneficial if Amazon provided the functionality to terminate idle instances natively, however, I assume it is beneficial for them not to fix this problem. This solution appears to work although it is highly customized and required some engineering effort to ‘figure out’. It would be fantastic to get some feedback for improvements or thoughts on other appropriate methodologies.

Thank you for reading and best of luck with your AWS / Data Science journey!

--

--