High Availability NAT with SNS and Lambda

Published in

Journey through the cloud

6 min readJan 3, 2016

AWS has provided EC2 image to let you easily spin off EC2 instance as NAT (network address translation). The next question is how we can set it up in high available mode? You probably have come across classic solution by AWS in their article leveraging on Elastic IP (EIP) and self monitoring script between 2 instances. What we want to do here, is to explore alternative approach, taking advantage of new service AWS Lambda.

AWS Lambda is a compute service where you can upload your code, and let the service run the code on your behalf using AWS infrastructure. It allows you to deploy and run code quickly on the cloud without worrying about infrastructure, network, instances and high availability. It’s all taken care by AWS. Lambda allows you to run Java, Python or Node JS. We are going to use Lambda as “command & control”, to receive event and taking action based on that event, as shown below:

In order to achieve High Availability, you need at least 2 NAT instances, preferably in 2 different availability zone. When one instance failed, Lambda get notified, and fix the problem.

Solution Overview

Get Auto scaling group (ASG) to manage NAT instances
ASG publish ‘instance termination’ event to SNS when one instance down
Lambda gets notified for the event
Lambda replaces routing table of affected subnet

How much does it cost?

The solution most likely doesn’t cost you anything. Here are the breakdown:

Usage of Auto scaling group is free
SNS Notification to Lambda function is free
Lambda is free for first 1 million calls per month (unlikely you will hot more)

Get ASG to manage NAT instances

You either already have NAT instances, or have not created one. If you already have existing NAT instances, create new Auto scaling group, and add your existing instances. If you do not have any, simply create new Auto scaling group, and let it create instances for you. Either way, you have to make sure the following setup in place:

Auto scaling min instance: 1, max:2, and desired: 2
Auto scaling Termination and Launch process enabled
Launch configuration that create NAT instances (eg. ami-1a9dac48)
Both NAT instances are in public subnets in different availability zones
Both NAT instances disabled Source/Dest. check
Private subnets that uses one of the NAT instances as gateway

Configure ASG to publish event to SNS Topic

We need to configure ASG to publish instance termination event to SNS topic. ASG can emit 4 different events: Terminate, Launch, Failed to terminate, Failed to launch. Terminate is the best place to hook up as it happen right after ASG terminate the instance. This way our Lambda function is able to detect any loophole in the routing table, left by killed NAT instance.

Create new SNS Topic, say: NAT_Failed :

Get ASG to send notification to NAT_Failed topic:

Configure Lambda to listen to SNS Topic

The next thing we need to do is to create Lambda function, and subscribe it to NAT_Failed SNS topic.

Before creating a lambda function, we need to create a role that will be assumed by lambda function. For our purpose, we assign EC2 Full Access and Lambda Role.

Now we can create a dummy lambda function that simply print incoming event to test the connectivity. Go to AWS Console-Compute-Lambda. Don’t worry if you found the service not available in your region. You simply select region that supports Lambda, and nearest to your location. The first Lambda screen will lead you to create function, based on available blueprint (template), just press skip, and create it manually:

Set handler to nat_failover.do, and role to the role you created earlier. Save to create Lambda function.

Now go over to SNS Topic, and paste your Lambda function ARN.

At this point you can test your Lambda to ensure it can receive event from the auto scaling group. Stop one of the NAT instance to simulate failure. Within seconds, the auto scaling group should detect and terminate the instance. Your lambda function should get invoked. Go to Lambda Monitoring tab, and click on View logs in CloudWatch button to see the logs.

Write Lambda code to modify subnet

Now it’s time to write the real code to failover NAT when one instance down. When ASG killed one of the NAT instance due to failure, one of the route table entry will have blackhole state. With this understanding our Lambda code will:

list down all route tables that has NAT instance in it
find out which route table with blackhole state in its entry
replace route table of the subnets associated with that route table

Go back to your Lambda function, and replace the code with:

import boto3from boto3.session import Sessionregion = ‘ap-southeast-1’
rts = { ‘<your-route-table1>’, ‘<your-route-table2>’ }def get_healthy_rt (troubled_rt_id):
 for rt_id in rts:
   if rt_id != troubled_rt_id:
     return rt_id
 
 
def do (event, context):
  
 session = Session(region_name=region)
 ec2 = session.resource(‘ec2’)
 for rt_id in rts:
   rt = ec2.RouteTable(rt_id)
   for route in rt.routes:
     if (route[‘State’])==’blackhole’:
       print (“affected route table found: “ + rt_id) 
       for assoc_attr in rt.associations_attribute:
         assoc = ec2.RouteTableAssociation(assoc_attr[‘RouteTableAssociationId’]) 
         healthy_rt = get_healthy_rt (rt_id)
         new_assoc = assoc.replace_subnet(DryRun=False,RouteTableId = healthy_rt )
         print (“troubled route table: “ + rt_id +” has been replaced with: “ + healthy_rt )
 
 return True

Replace <your-route-table> with your own route table. This is route table that that has NAT instance in it.

What is next?

Above solution is just a tip of the iceberg. You can easily expand this solution to address other common availability issues, such as HA Proxy and SFTP server. In the above solution, we get Lambda to swap routing table in the affected subnet. In HA Proxy case, you can get Lambda to swap elastic IP from failed instance to secondary instance. In SFTP case, you can get Lambda to launch new SFTP server, and swap EBS volume from main SFTP server to new SFTP server.

What I really like about this solution is it’s seamless and streamlined. You don’t have to deploy any monitoring script in your running instances. You don’t have to worry about making sure the monitoring script running all the time. The solution comes from the AWS service itself, that will ensure reliability and availability.

That’s it, readers. Do let me know what you think? Please share if you have other interesting ideas making use of SNS and Lambda.