AWS Lambda — serverless — Python — DEVOPS

10 Recommendations for writing pragmatic AWS Lambdas in Python

This article presents a collection of design patterns and best practices to implement AWS Lambdas in Python. It’s the first of two articles about writing and testing DevOps-y Lambdas.

Jan Groth
The Startup

--

Much like software applications, infrastructure provisioning has moved away from monolithic solutions, shifting focus towards decoupled components and utilisation of public service APIs. Tasks that traditionally would have required a great deal of orchestration and heavy tooling have transformed into lightweight event-driven services. Frameworks like the AWS Serverless Application Model (AWS SAM) have come a long way and make it easy to implement complex applications in a “microservice-style”, often with little more than a few Lambdas as building blocks.

In this series of two articles, I want to look at AWS Lambdas, written in Python, as the core elements of modern infrastructure stacks.

Taking a real-world application as an example, I will step through different aspects of source and test code and explain my approach and how it contributes to the overall goal: Implementing AWS Lambdas that are easy to understand, simple to monitor and fully unit-tested.

In this first article, I’ll introduce the example scenario and focus on the Lambda itself, looking at the source code and the patterns I used to increase readability and testability. Furthermore, I want to show how providing meaningful logging greatly reduces monitoring and debugging costs.

The second article will deep-dive into testing, explaining techniques to write effective unit-tests and demonstrate how they help reduce development efforts.

I have been a happy Java developer for many years, and most of the patterns that I’m using are well-known software development patterns — sometimes slightly adopted for cloud computing. If you have a developer background yourself, you’ve probably already seen many of them.

So what this article really is about: treating DevOps code like a software development project.

Before we start, let’s look at the example used throughout the article:

A real-world example — restricting inbound and outbound traffic for default security groups

https://versent.com.au/

As a DevOps engineer at Versent, I’m helping clients with their cloud infrastructure. The example I picked for this article comes from a challenge we frequently face: making sure that AWS workloads comply with modern industry and security standards.

One aspect of this goal is setting guardrails for AWS resources so they don’t cause a security risk. This often requires deploying a sensible default configuration as well as putting controls in place that ensure that the system can’t drift at a later stage.

One approach to avoid drift is chaining reactive components to observe system events and take appropriate action if required:

Observing system events with reactive components

This could, for example, be used to ensure that every new S3 has bucket logging enabled: On creation of a new bucket (create-bucket()API call in CloudTrail), an EventRule triggers a Lambda which programmatically enables bucket logging (put_bucket_logging() API call in boto).

For this article I’ve picked a slightly more complex scenario:

AWS VPCs are provisioned with default security groups that allow all outbound and certain inbound traffic. While this is a good solution for AWS to get new users started quickly in their private accounts, it’s considered unsafe for production accounts and needs to be restricted to comply with the AWS CIS Foundations Benchmark.

As it’s not possible to delete default security groups all we can do is revoke egress and ingress on the existing groups. It’s not good enough to do this once, we also need to make sure that developers cannot simply re-open access at a later stage.

Following the layout suggested above, an implementation with AWS SAM can look like this:

Automatically revoking ingress/egress for a security group

(1) All API invocations are automatically logged to CloudTrail — per AWS account.
(2) CloudTrail events are automatically forwarded to EventBridge.
(3) An EventsRule listens to Ingress/Egress changes on security groups and triggers a Lambda.
(4) The Lambda then revokes traffic rules if the security group is a default security group; to not drive developers insane over magically disappearing traffic rules it also adds a tag that explains what just happened.

Example project

https://github.com/jangroth/writing-aws-lambdas-in-python

The complete working example can be found on GitHub:

github.com/jangroth/writing-aws-lambdas-in-python

All source code examples are taken from there. You are welcome to follow along by setting up the project yourself.

Making code readable and testable

Imagine yourself on a support call in the middle of the night, staring at a piece of code that’s not doing what it’s supposed to be doing. On the other end of the call: The client, asking for updates every 30 seconds.

What’s more important at this moment: The fact that the code uses a semi-documented programming trick to squeeze out a marginal performance gain, or a slightly less performant version of the same code that you still understand at 2 a.m. in the morning?

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

— Donald Knuth, author of “The Art of Computer Programming

As a universal rule of thumb: Write code that is as readable as possible, and only optimize for performance if you really have to.

1. Linters are free as in beer

A quick recommendation before we get started with code — integrating linters into your project doesn’t cost you more than a few lines in the makefile, but will almost certainly surface issues that you would have overlooked otherwise:

check: ## Run linters
flake8
yamllint -f parsable .
cfn-lint -f parseable
@echo '*** all checks passing ***'
[→ source]

Here, I use flake8 for Python, yamllint for YAML and cfn-lint for CloudFormation. This will provide good coverage for an AWS stack.

Now, let’s look at the actual Lambdas implementation:

2. Separate the Lambda handler from the core logic

This is also the first advice on Best Practices for Working with AWS Lambda Functions: Logic in handlers can be difficult to follow and is extremely hard to test. I’ll talk about unit testing in the second article, but it’s the basic layout of the code that’s setting the scene for later testing.

This is a pattern that works well for me:

class RevokeDefaultSg:

def process_event(self, event):
# extract security group name from event
# if security group is default security group:
# revoke ingress
def handler(event, context):
return RevokeDefaultSg().process_event(event)
[→ source]

The core idea is to isolate the complete logic into a class, RevokeDefaultSg, with a single public entry point. This lays out the foundation for a top-down implementation approach, where process_event() contains the high-level logic and delegates work to private helper method.

An equally important aspect: Using a Python class for the logic creates an enclosure that makes it very easy to unit-test the Lambda. Actually, as we’ll see in the second article, not using a class makes it much harder to test the Lambda.

3. Implement separate responsibilities in separate methods

This is about readability as much as testability:

class RevokeDefaultSg:    

def _extract_sg_id(self, event):
pass

def _is_default_sg(self, sg_id):
pass

def _revoke_and_tag(self, sg_id):
pass

def process_event(self, event):
sg_id = self._extract_sg_id(event)
if self._is_default_sg(sg_id):
self._revoke_and_tag(sg_id)
return 'SUCCESS'
[→ source]

All of the logic in process_event — the only public method and entry point — is delegated to helper methods. This makes the method itself extremely easy to understand and explains everything the Lambda does in just three lines of code. Ideally, process_event should tell the whole story of what the lambda does, without the need to look any further.

4. Push external dependencies into small, low-level methods

While it is possible to write unit tests for code that contains external dependencies, they still make testing life a little difficult.

In our example Lambda we need to tell whether a given security group id belongs to a default security group or not. This could be implemented as a straight boto call in process_event—but writing tests against 3rd party code at the entry-point level can quickly become painful. Also, checking for default security groups is a different task than processing the event itself and shouldn’t be part of the main method as per the previous section on separating responsibilities.

It’s much better for both testability and readability to extract the check into a separate method:

def _is_default_sg(self, sg_id):
sec_groups = self.ec2_client.describe_security_groups(...)
return sec_groups[0]["GroupName"] == "default"
[→ source]

5. Initialise external dependencies in the constructor if you can

In the example above you’ve probably noticed that the boto client (ec2_client) is accessed through an instance attribute (self.ec2_client). This is only possible because the client — and all other dependencies — have already been initialised in the constructor:

def __init__(self, region="ap-southeast-2"):
self.ec2_client = boto3.client("ec2", region_name=region)
self.ec2_resource = boto3.resource("ec2", region_name=region)
[→ source]

Moving the creation and setup of external dependencies into the constructor has a few advantages:

  • If there are any problems while creating dependencies, the code will fail immediately on object creation — this is much better than failing half-way through execution, potentially after data has already been manipulated.
  • Dependencies are often expensive to create. Sharing them across the code allows reuse, and constructors are a good location to keep them.
  • Spoiler alert: When unit-testing the Lambda we will create test objects without executing the constructor. Not having to deal with dependencies on object creation makes test setup much easier. This will be covered in the second article which focuses on testing aspects.

Making code easy to monitor and to debug

Lambdas are tiny snippets of code that run independently in containers somewhere in the cloud. User interaction is not supported, and output typically only goes as far as CloudWatch.

Decision tree for error handling

This removes all need to beautify output and translate error message into a user-friendly format. Logs in CloudWatch only have to be understandable to developers.

In many cases, it also removes the need to recover from erroneous situations. It’s much better to fail hard and fast — with trust in monitoring — than attempting to deal with a possible chain of problems at a rate of thousand invocations per second.

6. Don’t handle errors just because you can

Revisiting the example from above:

def _is_default_sg(self, sg_id):
sec_groups = self.ec2_client.describe_security_groups(...)['SecurityGroups']
return sec_groups[0]["GroupName"] == "default"
[→ source]

There are many implicit assumptions in this code:

  • ec2_client has been successfully created
  • describe_security_groups() returns a dict with the key 'SecurityGroups'
  • This dict contains a list of dict with the key 'GroupName' in it

What are the options if any of these assumptions fail?

Not many, as another party broke their contract with us. Maybe boto has changed the semantics of the API call, or the AWS ec2 service itself now returns different results. It’s impossible to know what has gone wrong and why, and certainly there’s no reasonable way to recover from it. All we can do is fail and investigate the logs at a later stage. Most likely code will have to be changed to fix the problem.

It might look tempting to still wrap the code with try...except, just in case. But often this is adding more noise than providing benefits, as the standard Python stack trace should already contain the information required to start the investigation.

7. Use custom exceptions for flow-specific errors

Sometimes it is good to communicate a certain error explicitly.

For example, a Lambda function can be part of a greater piece of orchestration and be invoked from different sources. In this case, it’s useful to validate the incoming event as a precondition to our logic and to fail with a clear error message if something is unexpected.

Custom exceptions only take up two lines and are a great way to communicate individual errors:

class UnknownEventException(Exception):
pass
[...]def _extract_sg_id(self, event):
is_correct_event_source = ...
is_correct_event_type = ...
event_sg_id = ...
if not (is_correct_event_source and is_correct_event_type and event_sg_id):
raise UnknownEventException(f"Cannot handle event: {event}")
return event_sg_id
[→ source]

This sends a clear message to CloudWatch and makes it easy to understand the problem.

8. Use a logger and pick the right logging level

All Lambda output ends up in CloudWatch — which makes it the central source for debugging and a general understanding of what’s going on. Not having enough information here can be problematic if something goes wrong. On the other hand, too much output makes it hard to see the forest for the trees.

printstatements are easy to use and don’t require any imports or configuration. While this might already sound sufficient, loggersgo a lot further: They add contextual information allow to bind the output to a logging level. This lets us control the amount of logging in relation to e.g. the environment the Lambda runs in.

When writing output under a certain logging level, it’s essential to use the different levels consistently. Otherwise, logging benefit gets watered-down and supporting services like CloudWatch Alarms are far less useful. The following convention has established as de-facto standard:

error — something has gone seriously wrong. A developer (you!) needs to look at this, code will likely have to change. Raise an incident if this happens on a production system. Only use this level if you are ready to be woken up in the middle of the night about this.

warning — something is unexpected, but we are still cool. E.g. a configuration value should have been there but wasn’t found. So we continue with a default value and log a warning. Someone should investigate.

info — the normal level. Someone looking at the info-level output of a Lambda should easily understand what happened. It’s important to strike the balance between too much and not enough. Creating an if branch just to add a logging statement is acceptable if it makes the flow easier to follow:

if self._is_default_sg(sg_id):
self.logger.info(f"Revoking ingress/egress [...]")
self._revoke_and_tag(sg_id)
else:
self.logger.info("[...] nothing to do.")
[→ source]

debug — fine-grained output that helps you develop the code and makes it possible to understand what’s going on a low level. Too noisy for info level. In this example there’s additional information about the exact action that was taken, adding to the general information already provided at info level:

if security_group.ip_permissions:
security_group.revoke_ingress(...)
should_tag = True
self.logger.debug("Revoking ingress rules")
if security_group.ip_permissions_egress:
should_tag = True security_group.revoke_egress(...)
self.logger.debug("Revoking egress rules")
if should_tag:
security_group.create_tags(...)
self.logger.debug("Adding tag.")
else:
self.logger.debug("No ingress/egress rules found to revoke")
[→ source]

9. Configure the minimum log level via CloudFormation

Now, how can deploys to “prod” have a different minimum log level than deploys to “test”?

It’s a good practice to separate configuration — everything that is likely to vary between deployments — from the application and to store it in the environment.

For the Lambda, let’s bring the configuration up to CloudFormation parameter level, so that different stacks can be deployed with different settings:

Parameters:
minimumLogLevel:
Type: String
Default: DEBUG
...
Resources:
...
RevokeDefaultSgLambda:
Type: AWS::Serverless::Function
Properties:
CodeUri: ./src
Handler: revokedefaultsg.app.handler
Runtime: python3.7
Role: !GetAtt RevokeDefaultSgLambdaRole.Arn
Timeout: 900
Environment:
Variables:
LOGGING: !Ref minimumLogLevel
[→ source]

All that’s left for the Lambda is reading out the environment variable:

class RevokeDefaultSg:    def __init__(self):
self.logger = logging.getLogger(self.__class__.__name__)
self.logger.setLevel(os.environ["LOGGING"])
[→ source]

For further reading I recommend “The Twelve-Factor App”: Store config in the environment; one of the factors of the twelve-factor methodology for building software-as-a-service applications.

10. Decorators are an elegant way to log Lambda context

Did you ever find yourself implementing the same statements for lambda invocation logging again and again? This is where Python decorators shine:

Simply define a decorator function somewhere in the Lambda code (or even better: in a Lambda layer):

def notify_cloudwatch(function):
@wraps(function)
def wrapper(event, context):
logger.info(f"'{context.function_name}' - entry:'{event}'")
result = function(event, context)
logger.info(f"'{context.function_name}' - entry.'{result}'")
return result
return wrapper[→ source]

And annotate the handler functions:

@notify_cloudwatch
def handler(event, context):
return RevokeDefaultSg().process_event(event)
[→ source]

This is getting even more useful when using AWS Step Functions — where multiple Lambdas are often combined into one source file with multiple handler methods.

Also, this technique can easily be extended to serve different channels:

@notify_cloudwatch
@notify_slack
def handler(event, context):
return RevokeDefaultSg().process_event(event)

The End (for now)

There’s surely much more out there to be considered, but from my experience, these 10 recommendations are a good baseline for implementing code that’s readable, maintainable and testable.

Some suggestions, like initialising external dependencies in constructors and pushing their usage out into small methods, might still seem a little strange — hopefully, this will change when we start looking at unit-testing the Lambda:

No time for tests? — 12 Recommendations on unit-testing AWS Lambdas in Python

In the meantime, please leave a comment if you have questions or want to share your thoughts.

Happy Coding :)

--

--

Jan Groth
The Startup

DevOps Engineer at Versent in Sydney. Loves writing code.