Route 53 DNS Failover with Lambda HealthChecks in Private Subnet

Žygimantas Čijunskis
10 min readSep 14, 2023

If you’re using a master/slave technology within a private AWS VPC subnet and need automatic failover to the slave when the master fails, AWS Route 53 failover functionality could be a suitable solution for your scenario.

In this article, we will implement this solution by utilising Lambda, CloudWatch Metrics/Alerts/Events, and Route 53 HealthChecks/Private Hosted Zones and DNS Records AWS resources.

Before proceeding, I assume you’ve already set up a VPC, configured AWS Access Keys, and installed the Terraform CLI, as we’ll be using these throughout the article.

If you need reference or assistance, you can find all the Terraform configuration files related to this project on my GitHub repository:

At the end of this article, your directory structure will look like this:

.
└── route53-failover/
├── main.tf
├── iam.tf
├── cloudwatch.tf
├── lambda_function.py
├── lambda.tf
└── route53.tf

Building the foundation:

Before diving into Route 53 details, we’ll first establish a foundation of Metric, Healthcheck, and Alert resources that will serve as the basis for the Route 53 failover functionality. Here’s a breakdown of what we’ll be building:

  • We’ll set up a Lambda function to perform TCP or HTTP health checks.
  • A CloudWatch Event will trigger this Lambda function every minute.
  • After the Lambda health check function runs, it will send a value of 1 for success or 0 for failure to the CloudWatch “LambdaCustomHC” namespace.
  • We’ll configure a CloudWatch Alert that triggers if the custom metric value drops below 1.
  • Lastly, on the Route 53 side, we’ll create a health check that monitors the status of the configured CloudWatch Alert. This health check will be used by Route 53’s failover functionality.
This visual represents the state after completing the first step.

We’ll begin by setting up IAM resources. The lambda_custom_hc_role role will have assigned policies for ENI and CloudWatch. The PutMetricData policy enables the Lambda function to send metrics to CloudWatch. EC2-related policies grant access for Lambda to create ENIs in private subnets, allowing access to the private nodes.

Here’s how the iam.tf configuration should look like:

# Define a trust policy document that allows Lambda to assume this role.
# This policy will be assigned to the lambda_role resource.
data "aws_iam_policy_document" "lambda_assume_role_policy"{
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["lambda.amazonaws.com"]
}
}
}

# Create an IAM role for Lambda and attach the trust policy to it.
resource "aws_iam_role" "lambda_role" {
name = "lambda_custom_hc_role"
assume_role_policy = data.aws_iam_policy_document.lambda_assume_role_policy.json
}

# Define an IAM policy that allows Lambda to send metrics to CloudWatch.
resource "aws_iam_policy" "iam_policy_for_lambda" {
name = "lambda_cloudwatch_putmetric_policy"
description = "Cloudwatch policy to allow Lambda to put Metric data"

policy = jsonencode({
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*",
},
{
"Effect": "Allow",
"Action": [
"ec2:CreateNetworkInterface",
"ec2:DeleteNetworkInterface",
"ec2:DescribeNetworkInterfaces"
],
"Resource": "*"
}
]
})
}

# Attach the IAM policy to the Lambda role.
resource "aws_iam_role_policy_attachment" "attachment" {
role = aws_iam_role.lambda_role.name
policy_arn = aws_iam_policy.iam_policy_for_lambda.arn
}

Next, we’ll create a Python script for our Lambda function. In short, it will check if it can establish a connection to the remote TCP socket. If successful, it will send a “1” to the LambdaCustomHC CloudWatch namespace.

I assume you’re familiar with Python basics and can customize IP addresses, ports, and add more function invocations. Below is how the lambda_function.py code with TCP checks should look like:

import socket
import boto3

def check_tcp_port(host, port):
try:
# Create a socket object
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(5) # Set a timeout for the connection attempt (in seconds)

# Attempt to connect to the remote host and port
sock.connect((host, port))
sock.close()

# If the connection was successful, return True
return True
except (socket.timeout, ConnectionRefusedError):
# If the connection attempt times out or is refused, return False
return False
except Exception as e:
# Handle other exceptions as needed
return str(e)

def send_cloudwatch_metric(metric_name, metric_value):
cloudwatch = boto3.client('cloudwatch')
namespace = 'LambdaCustomHC'

# Send the custom metric to CloudWatch
response = cloudwatch.put_metric_data(
Namespace=namespace,
MetricData=[
{
'MetricName': metric_name,
'Value': metric_value,
'Unit': 'Count',
'StorageResolution': 1
},
]
)

def lambda_handler(event, context):
port = 8888 # Replace with the TCP port you want to check

# Checking test_node_1
if check_tcp_port('10.0.0.1', port):
send_cloudwatch_metric('test_node_1_availability', 1)
else:
send_cloudwatch_metric('test_node_1_availability', 0)

# Checking test_node_2
if check_tcp_port('10.0.0.2', port):
send_cloudwatch_metric('test_node_2_availability', 1)
else:
send_cloudwatch_metric('test_node_2_availability', 0)

return {
'statusCode': 200,
'body': 'Private endpoint is alive'
}

If you prefer HTTP checks over TCP, you’ll need a different script. The script below will verify if the HTTP response contains a specific target string. If it finds the string, it will send “1” to the CloudWatch LambdaCustomHC Metric namespace.

This is how lambda_function.py script for HTTP checks should look like:

import requests
import boto3

def send_cloudwatch_metric(metric_name, metric_value):
cloudwatch = boto3.client('cloudwatch')
namespace = 'LambdaCustomHC'

# Send the custom metric to CloudWatch
response = cloudwatch.put_metric_data(
Namespace=namespace,
MetricData=[
{
'MetricName': metric_name,
'Value': metric_value,
'Unit': 'Count',
'StorageResolution': 1
},
]
)

def check_url(url, cloudwatch_target, target_string):
try:
# Make the HTTP request
response = requests.get(url, timeout=10)
response_text = response.text

# Check if the target string is present in the response
if target_string in response_text:
send_cloudwatch_metric(cloudwatch_target, 1)
return f"URL: {url} - OK"
else:
send_cloudwatch_metric(cloudwatch_target, 0)
return f"URL: {url} - Unexpected Response."
except Exception as e:
send_cloudwatch_metric(cloudwatch_target, 0)
return f"URL: {url} - An error occurred: {str(e)}"

def lambda_handler(event, context):

# String to check in the HTTP response
target_string = "<target string>" # Replace with your target string

results = []

result = check_url("http://10.0.0.1:8888/", "test_node_1_availability", target_string)
results.append(result)

result = check_url("http://10.0.0.2:8888/", "test_node_2_availability", target_string)
results.append(result)

result = check_url("http://10.0.0.3:8888/", "test_node_3_availability", target_string)
results.append(result)

return {
"statusCode": 200,
"body": "\n".join(results)
}

After finishing with the Python script, we’ll create a Lambda configuration that packages our lambda_function.py script. This configuration will also create a Lambda function and will set up a CloudWatch event resource to schedule our Lambda function to run every minute.

The only modifications needed for it to work are in the local variables. Here's how the lambda.tf configuration should look like:

# Define local variables for subnet ID, security group, and lambda function name.
locals {
subnet_id = ["subnet-00000"]
security_group = ["sg-00000"]
lambda_function_name = "test_node_hc"
}

# Create an archive file containing the Lambda function code.
data "archive_file" "lambda_hc_zip" {
type = "zip"
source_file = "lambda_function.py"
output_path = "healthcheck_function.zip"
}

# Define an AWS Lambda function resource.
resource "aws_lambda_function" "lambda" {
function_name = "${local.lambda_function_name}_function"
filename = data.archive_file.lambda_hc_zip.output_path
source_code_hash = data.archive_file.lambda_hc_zip.output_base64sha256
role = aws_iam_role.lambda_role.arn
handler = "lambda_function.lambda_handler"
runtime = "python3.7"
timeout = 20

vpc_config {
subnet_ids = "${local.subnet_id}"
security_group_ids = "${local.security_group}"
}
}

# Define an AWS CloudWatch event rule to schedule the Lambda function.
resource "aws_cloudwatch_event_rule" "lambda_event" {
name = "run_${local.lambda_function_name}_function"
description = "Schedule lambda function"
schedule_expression = "rate(1 minute)"
}

# Define an AWS CloudWatch event target to associate with the Lambda function.
resource "aws_cloudwatch_event_target" "lambda_function_target" {
target_id = "${local.lambda_function_name}_function_target"
rule = aws_cloudwatch_event_rule.lambda_event.name
arn = aws_lambda_function.lambda.arn
}

# Define an AWS Lambda permission to allow execution from CloudWatch events.
resource "aws_lambda_permission" "allow_cloudwatch" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.lambda.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.lambda_event.arn
}

Next, we’ll set up CloudWatch alerts for the metrics in our CloudWatch LambdaCustomHC Metric namespace. These alerts will play a crucial role in the Route 53 Healthchecks configuration, which we'll cover later.

If you haven't made any changes to the CloudWatch-related configuration in the lambda_function.py file, the cloudwatch.tf file should look like this:

resource "aws_cloudwatch_metric_alarm" "first_node_cw_alert" {
alarm_name = "test_node_1_lambda_hc_alert"
comparison_operator = "LessThanThreshold"
evaluation_periods = "1"
metric_name = "test_node_1_availability"
namespace = "LambdaCustomHC"
period = "60"
statistic = "Maximum"
threshold = "1"
alarm_description = "This metric monitors service availability"
}

resource "aws_cloudwatch_metric_alarm" "second_node_cw_alert" {
alarm_name = "test_node_2_lambda_hc_alert"
comparison_operator = "LessThanThreshold"
evaluation_periods = "1"
metric_name = "test_node_2_availability"
namespace = "LambdaCustomHC"
period = "60"
statistic = "Maximum"
threshold = "1"
alarm_description = "This metric monitors service availability"
}

resource "aws_cloudwatch_metric_alarm" "third_node_cw_alert" {
alarm_name = "test_node_3_lambda_hc_alert"
comparison_operator = "LessThanThreshold"
evaluation_periods = "1"
metric_name = "test_node_3_availability"
namespace = "LambdaCustomHC"
period = "60"
statistic = "Maximum"
threshold = "1"
alarm_description = "This metric monitors service availability"
}

For the final step in setting up Healthcheck-related resources, we’ll create Route 53 HealthCheck. These HealtChecks will monitor the CloudWatch alerts we set up earlier. When a CloudWatch alert triggers, Route 53 Healthcheck alert will also activate, initiating the failover functionality.

Modify local variables and AWS Route 53 health check resource tags as necessary. Include these resources in the route53.tf file and modify local variables and AWS Route 53 health check resource tags as necessary:

# Define local variables for configuration settings.
locals {
region = "eu-central-1"
}

# Create Route 53 health checks for the first, second, and third nodes.
resource "aws_route53_health_check" "first_route53_hc" {
type = "CLOUDWATCH_METRIC"
cloudwatch_alarm_name = aws_cloudwatch_metric_alarm.first_node_cw_alert.alarm_name
cloudwatch_alarm_region = "${local.region}"
insufficient_data_health_status = "LastKnownStatus"

tags = {
Name = "test_node_1_dns_hc"
}
}

resource "aws_route53_health_check" "second_route53_hc" {
type = "CLOUDWATCH_METRIC"
cloudwatch_alarm_name = aws_cloudwatch_metric_alarm.second_node_cw_alert.alarm_name
cloudwatch_alarm_region = "${local.region}"
insufficient_data_health_status = "LastKnownStatus"

tags = {
Name = "test_node_2_dns_hc"
}
}

resource "aws_route53_health_check" "third_route53_hc" {
type = "CLOUDWATCH_METRIC"
cloudwatch_alarm_name = aws_cloudwatch_metric_alarm.third_node_cw_alert.alarm_name
cloudwatch_alarm_region = "${local.region}"
insufficient_data_health_status = "LastKnownStatus"

tags = {
Name = "test_node_3_dns_hc"
}
}

With Metrics and HealthChecks in place, we can now proceed to set up DNS failover functionality.

Configuring Route 53 Resources:

This visual represents the state after completing the last step.

To finalize the configuration, we’ll set up Route 53 Private Zone and Failover Record resources. You’ll only need to adjust the local variables to make everything work seamlessly.

Include the resources below in the route53.tf file. After modifying zone_name, main_target_domain and failover_target_domain local values - the FQDN of the failover records after deploying resources will look like this:web.nodes.local (This is the entrypoint for your failover record)web-failover.nodes.local (This should be used only by main failover record, don't use it as an entrypoint)

# Define local variables for configuration settings.
locals {
ttl = 60
zone_name = "nodes.local"
first_node_ip = "10.0.0.1"
second_node_ip = "10.0.0.2"
third_node_ip = "10.0.0.3"
main_target_domain = "web"
failover_target_domain = "web-failover"
vpc_id = "vpc-000000"
}

# Create a private Route 53 zone associated with the specified VPC.
resource "aws_route53_zone" "private" {
name = "${local.zone_name}"

vpc {
vpc_id = "${local.vpc_id}"
}
}

# Create Route 53 records for main (primary and secondary) and failover (primary and secondary) targets.
resource "aws_route53_record" "main_primary" {
zone_id = aws_route53_zone.private.zone_id
name = "${local.main_target_domain}"
type = "A"
ttl = "${local.ttl}"

failover_routing_policy {
type = "PRIMARY"
}

set_identifier = "primary"
records = ["${local.first_node_ip}"]
health_check_id = aws_route53_health_check.first_route53_hc.id
}

resource "aws_route53_record" "main_secondary" {
zone_id = aws_route53_zone.private.zone_id
name = "${local.main_target_domain}"
type = "A"

failover_routing_policy {
type = "SECONDARY"
}

set_identifier = "secondary"

alias {
name = "${aws_route53_record.failover_primary.name}.${local.zone_name}"
zone_id = aws_route53_zone.private.zone_id
evaluate_target_health = false
}
}

resource "aws_route53_record" "failover_primary" {
zone_id = aws_route53_zone.private.zone_id
name = "${local.failover_target_domain}"
type = "A"
ttl = "${local.ttl}"

failover_routing_policy {
type = "PRIMARY"
}

set_identifier = "primary"
records = ["${local.second_node_ip}"]
health_check_id = aws_route53_health_check.second_route53_hc.id
}

resource "aws_route53_record" "failover_secondary" {
zone_id = aws_route53_zone.private.zone_id
name = "${local.failover_target_domain}"
type = "A"
ttl = "${local.ttl}"

failover_routing_policy {
type = "SECONDARY"
}

set_identifier = "secondary"
records = ["${local.third_node_ip}"]
health_check_id = aws_route53_health_check.third_route53_hc.id
}

To top it off, don’t forget to add the necessary providers in the main.tf file.

provider "aws" {
region = "eu-central-1"
}
provider "archive" {}

Applying and testing changes:

Once the resources are created:

aws_route53_zone.private: Creating...
aws_cloudwatch_event_rule.lambda_event: Creating...
aws_iam_policy.iam_policy_for_lambda: Creating...
aws_iam_role.lambda_role: Creating...
aws_cloudwatch_metric_alarm.third_node_cw_alert: Creating...
aws_cloudwatch_metric_alarm.second_node_cw_alert: Creating...
aws_cloudwatch_metric_alarm.first_node_cw_alert: Creating...
aws_cloudwatch_metric_alarm.second_node_cw_alert: Creation complete after 0s [id=]
aws_cloudwatch_event_rule.lambda_event: Creation complete after 0s [id=]
aws_cloudwatch_metric_alarm.third_node_cw_alert: Creation complete after 0s [id=]
aws_route53_health_check.second_route53_hc: Creating...
aws_route53_health_check.third_route53_hc: Creating...
aws_cloudwatch_metric_alarm.first_node_cw_alert: Creation complete after 0s [id=]
aws_route53_health_check.first_route53_hc: Creating...
aws_iam_policy.iam_policy_for_lambda: Creation complete after 1s [id=arn:aws:iam:::policy/]
aws_iam_role.lambda_role: Creation complete after 1s [id=lambda_custom_hc_role]
aws_iam_role_policy_attachment.attachment: Creating...
aws_lambda_function.lambda: Creating...
aws_iam_role_policy_attachment.attachment: Creation complete after 0s [id=]
aws_route53_health_check.second_route53_hc: Creation complete after 2s [id=]
aws_route53_health_check.third_route53_hc: Creation complete after 2s [id=]
aws_route53_health_check.first_route53_hc: Creation complete after 2s [id=]
aws_route53_zone.private: Still creating... [10s elapsed]
aws_lambda_function.lambda: Still creating... [10s elapsed]
aws_lambda_function.lambda: Creation complete after 15s [id=]
aws_lambda_permission.allow_cloudwatch: Creating...
aws_cloudwatch_event_target.lambda_function_target: Creating...
aws_cloudwatch_event_target.lambda_function_target: Creation complete after 0s [id=]
aws_lambda_permission.allow_cloudwatch: Creation complete after 0s [id=]
aws_route53_zone.private: Still creating... [20s elapsed]
aws_route53_zone.private: Still creating... [30s elapsed]
aws_route53_zone.private: Creation complete after 40s [id=]
aws_route53_record.failover_secondary: Creating...
aws_route53_record.failover_primary: Creating...
aws_route53_record.main_primary: Creating...
aws_route53_record.failover_secondary: Still creating... [10s elapsed]
aws_route53_record.main_primary: Still creating... [10s elapsed]
aws_route53_record.failover_primary: Still creating... [10s elapsed]
aws_route53_record.main_primary: Still creating... [20s elapsed]
aws_route53_record.failover_secondary: Still creating... [20s elapsed]
aws_route53_record.failover_primary: Still creating... [20s elapsed]
aws_route53_record.failover_primary: Creation complete after 21s [id=]
aws_route53_record.main_secondary: Creating...
aws_route53_record.failover_secondary: Creation complete after 22s [id=]
aws_route53_record.main_primary: Creation complete after 29s [id=]
aws_route53_record.main_secondary: Still creating... [10s elapsed]
aws_route53_record.main_secondary: Creation complete after 16s [id=]

Apply complete! Resources: 18 added, 0 changed, 0 destroyed.

You should observe that web.nodes.local responds with the correct IP address:

;; ANSWER SECTION:
web.nodes.local. 60 IN A 10.0.0.1

;; Query time: 4 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Mon Sep 11 09:10:14 UTC 2023
;; MSG SIZE rcvd: 59

When the 10.0.0.1 node becomes unhealthy, the DNS failover will be triggered, and you'll notice the following changes in the dig response:

;; ANSWER SECTION:
web.nodes.local. 60 IN A 10.0.0.2

;; Query time: 4 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Mon Sep 11 09:30:15 UTC 2023
;; MSG SIZE rcvd: 59

For enhanced observability, consider implementing additional functionality in your Python script to send messages to your monitoring platform. Additionally, setting up an SNS topic for email notifications would be a valuable addition.

Thank you! 🙏

--

--