Building IT Systems with Well-Architected Framework & Infrastructure as Code (With practical AWS App)

Lokesh Aggarwal
8 min readMay 13, 2024

--

Introduction

In today’s dynamic cloud computing landscape, organizations require efficient, scalable, and secure IT infrastructure. Implementing a Well-Architected Framework (WAF) with Infrastructure as Code (IaC) is a powerful strategy to achieve these goals. This article explores the key components of this approach and its benefits for businesses.

What is the Well-Architected Framework?

Developed by AWS, the Well-Architected Framework provides best practices across five pillars:

  • Operational Excellence: Automating processes, implementing continuous monitoring, and optimizing resources for efficiency.
  • Security: Enforcing encryption, access control, and compliance standards, along with regular security audits for proactive risk mitigation.
  • Reliability: Building fault-tolerant systems with redundancy, conducting performance testing, and establishing disaster recovery plans.
  • Performance Efficiency: Optimizing databases, using caching, and implementing dynamic scaling for smooth performance under varying loads.
  • Cost Optimization: Analyzing resource usage, selecting cost-effective solutions, and leveraging cloud services for maximum ROI.

By aligning with these pillars, organizations ensure their infrastructure meets business objectives while adhering to industry best practices.

Benefits of the Well-Architected Framework

  • Enhanced Security: WAF promotes a secure infrastructure through its focus on access control, encryption, and compliance.
  • Improved Reliability: By building fault-tolerant systems and implementing disaster recovery plans, WAF minimizes downtime and data loss.
  • Optimized Performance: WAF helps achieve optimal performance through database optimization, caching, and dynamic scaling.
  • Reduced Costs: By analyzing resource usage and leveraging cost-effective solutions, WAF promotes cost-efficiency in cloud infrastructure.

Infrastructure as Code (IaC) Supercharges WAF

IaC treats infrastructure as software, enabling management and provisioning through code. This automates deployments, scaling, and ensures consistency and repeatability.

Benefits of Infrastructure as Code

  • Faster Provisioning and Deployment: IaC automates resource provisioning, significantly speeding up deployments.
  • Improved Consistency and Reliability: IaC eliminates manual errors, leading to more consistent and reliable infrastructure configurations.
  • Scalability and Flexibility: IaC allows for easy adaptation to changing business needs by scaling infrastructure up or down as required.

Popular IaC Tools

  • Terraform: A leading IaC tool using a declarative language to define infrastructure as code.
  • AWS CloudFormation: An AWS-specific IaC tool for provisioning and managing resources using templates.
  • Ansible: A configuration management tool that automates infrastructure setup through playbooks.

Practical Example

Architecture of the e-commerce application

This setup depicts a multi-region, highly available 3 tier e-commerce application on AWS and implements Well Architected Framework.

  • Regions: App hosted in 2 geographically non-adjacent regions (e.g. US East and EU West)
  • Internet Traffic: Internet traffic controlled through internet gateway and route traffic.
  • AWS Shield Advanced to protect the web and app layer from attacks like DDoS
  • Front End: EC2 Web Servers hosted in their own VPC, behind a load balancer and Reverse Proxy. Web Security Group and rules controlling traffic ingress and egress.
  • Application Servers: EC2 App Servers hosted in their own VPC, behind a load balancer. App Security Group and rules control traffic ingress and egress.
  • AutoScaling: Implement autoscaling for Web and App servers with timeout or threshold hooks. e.g. heartbeat timeout = 300 seconds, Requests per second = 100
  • Application Data: Stored in S3 buckets with encryption enabled. Keys stored in AWS KMS. CloudFront Setup to cache the data on Region edges.
  • Database Servers: Data stored in High Availability RDS cluster and replicated between regions using AWS DMS. Setup indexing and caching to improve database performance.
  • Encryption of data in transit: Security Groups allow only inbound traffic using port 443 and port 80 rules are removed. Load balancers use Security groups and only listen on port 443. CloudFront variable redirect-to-HTTPS is set to 443 to redirect any HTTP traffic to HTTPS.
  • Backup and Recovery: Setup and configure Amazon Backup Service for full and incremental backups of all components including RDS databases. Also, store backups in tamper proof S3 Buckets called Backup Vault.
  • Patching: Implement AWS Systems Manager for automated OS patching. Define the patching steps in an SSM document and create a Patch Baseline referencing it. This Baseline can target specific instance groups or individual instances. Associations can be created to trigger immediate patching on chosen instances.
  • Application Performance: Implement Amazon CloudWatch Logs or ELK Stack for logging, AWS X-Ray or DataDog for APM and Promotheus to store metrics.

How does the above architecture implement Well Architected Framework

1. Reliability

  • Regions and VPCs: EC2 Web Servers and App Servers hosted in separate VPCs provide isolation and improve fault tolerance.
  • Auto Scaling: Automatic scaling for Web and App Servers ensures high availability by adjusting resources based on demand.
  • High Availability RDS Cluster: Replicated database with AWS DMS ensures data availability in case of regional outages.
  • CloudFront: Improves availability by caching static content on geographically distributed edge locations.

2. Security

  • VPCs with Security Groups: Isolate resources within VPCs and control inbound/outbound traffic with security groups.
  • AWS Shield Advanced: Protects against DDoS attacks and improves overall security posture.
  • Encryption: Encryption of data at rest (S3 with KMS) and in transit (HTTPS with security groups and CloudFront redirect).
  • AWS Systems Manager Patching: Automates OS patching to address vulnerabilities and improve security.

3. Performance Efficiency

  • Load Balancers: Distribute traffic across Web and App Servers for optimal performance.
  • CloudFront: Reduces latency for users by caching static content closer to their location.
  • Database Optimization: Indexing and caching in RDS improve database performance.
  • Application Performance Monitoring (APM) with X-Ray or Datadog: Identifies bottlenecks and optimizes application performance.

4. Cost Optimization

  • Auto Scaling: Scales resources up or down based on demand, avoiding unnecessary costs.

5. Operational Excellence

  • CloudWatch Logs or ELK Stack: Centralized logging for easier troubleshooting and operational visibility.
  • Prometheus for Metrics: Stores application metrics for analysis and performance optimization.
  • Amazon Backup Service: Automates backups of all components for disaster recovery.
  • AWS Systems Manager Patching: Automates OS patching, reducing manual intervention.

6. Sustainability

  • Spot Instances: Utilizing unused EC2 capacity can be a more sustainable option compared to on-demand instances.
  • Auto Scaling: Ensure Auto Scaling selects the optimal instance type to minimize resource utilization and associated costs.

Sample Terraform Code Snippets

# Define VPCs for web and app tiers in each region
variable “region” {
type = string
}

data “aws_availability_zones” “available” {
state = var.region
}

resource “aws_vpc” “web_vpc” {
cidr_block = “10.0.0.0/16”
tags = {
Name = format(“web-vpc-%s”, var.region)
}
supplementary_cidr_blocks = [“10.1.0.0/16”] # For potential future expansion
}

resource “aws_vpc” “app_vpc” {
cidr_block = “10.2.0.0/16”
tags = {
Name = format(“app-vpc-%s”, var.region)
}
supplementary_cidr_blocks = [“10.3.0.0/16”]
}

resource “aws_subnet” “web_public_subnet” {
count = length(data.aws_availability_zones.available.names)
vpc_id = aws_vpc.web_vpc.id
cidr_block = cidrsubnet(aws_vpc.web_vpc.cidr_block, 2, count.index + 1)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = {
Name = format(“web-public-subnet-%s-%s”, var.region, count.index + 1)
}
}

resource “aws_subnet” “web_private_subnet” {
count = length(data.aws_availability_zones.available.names)
vpc_id = aws_vpc.web_vpc.id
cidr_block = cidrsubnet(aws_vpc.web_vpc.cidr_block, 3, count.index + 1)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = format(“web-private-subnet-%s-%s”, var.region, count.index + 1)
}
}

resource “aws_subnet” “app_private_subnet” {
count = length(data.aws_availability_zones.available.names)
vpc_id = aws_vpc.app_vpc.id
cidr_block = cidrsubnet(aws_vpc.app_vpc.cidr_block, 2, count.index + 1)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = format(“app-private-subnet-%s-%s”, var.region, count.index + 1)
}
}

Web Servers with Load Balancer and Reverse Proxy

resource “aws_instance” “web_server” {
count = 4
ami = var.web_server_ami
instance_type = var.web_server_instance_type
vpc_security_group_ids = [aws_security_group.web_sg.id]
subnet_id = aws_subnet.web_private_subnet[count.index % length(data.aws_availability_zones.available.names)].id

# User data script to install and configure Nginx as a reverse proxy
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y nginx
systemctl start nginx
# Configure Nginx to forward traffic to app servers using the internal load balancer DNS name
# … (Add specific configuration based on your application)
systemctl enable nginx
EOF

tags = {
Name = format(“web-server-%s”, count.index + 1)
}
}

resource “aws_lb” “web_lb” {
name = format(“web-lb-%s”, var.region)
internal = true
security_groups = [aws_security_group.web_sg.id]
subnets = aws_subnet.web_public_subnet.*.id

load_balancer_type = “application”
health_check {
timeout = 5
interval = 30
path = “/health”
port = “80” # Adjust port if your health check uses a different port
}
}

resource “aws_lb_target_group” “web_tg” {
name = format(“web-tg-%s”, var.region)
port = 80
protocol = “http”
vpc_id = aws_vpc.web_vpc.id

health_check {
timeout = 5
interval = 30
path = “/health”
port = “80” # Adjust port if your health check uses a different port
protocol = “http”
}
}

resource “aws_lb_target_group_attachment” “web_tg_attachment” {
count = aws_instance.web_server.count
target_group_arn = aws_lb_target_group.web_tg.arn
target_id = aws_instance.web_server[count.index].private_ip
port = 80 # Adjust port if your application listens on a different port
}

S3 Buckets and CDN

resource “aws_s3_bucket” “static_content” {
bucket = format(“your-ecommerce-app-%s-static”, var.region)
acl = “private”

server_side_encryption_configuration {
rule {
apply_server_side_encryption_by_default {
kms_master_key_id = aws_kms_key.content_encryption_key.arn
sse_algorithm = “aws:kms”
}
}
}

tags = {
Name = “static-content”
}
}

resource “aws_kms_key” “content_encryption_key” {
description = “Customer managed key for S3 bucket encryption”
is_customer_managed = true
}

resource “aws_s3_bucket_public_access_block” “static_content_block” {
bucket = aws_s3_bucket.static_content.id

block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}

resource “aws_cloudfront_distribution” “static_content_cdn” {
origin {
domain_name = aws_s3_bucket.static_content.website_endpoint
}

restrictions {
geo_restriction {
restriction_type = “whitelist”
countries = [“US”, “GB”] # Allow access from specific countries (optional)
}
}

default_cache_behavior {
allowed_methods = [“GET”, “HEAD”]
cached_methods = [“GET”, “HEAD”]
min_ttl = 3600 # Minimum cache time for objects in seconds
target_origin_id = “s3-origin”
viewer_protocol_policy = “redirect-to-https” # Enforce HTTPS
}

viewer_certificate {
acm_certificate_arn = aws_acm_certificate.static_content_cert.arn
ssl_support_method = “sni-only”
}

logging_config {
bucket = “your-cloudfront-access-logs” # Replace with your log bucket name
prefix = format(“cloudfront-%s”, var.region)
}

tags = {
Name = “static-content-cdn”
}
}

resource “aws_acm_certificate” “static_content_cert” {
domain_name = format(“*.your-ecommerce-app-%s.com”, var.region)
validation_method = “DNS”

subject_alternative_names = [
format(“*.your-ecommerce-app-%s.com”, var.region),
]
}

Auto-Scaling

resource “aws_autoscaling_group” “web_asg” {
name = format(“web-asg-%s”, var.region)
min_size = 2
max_size = 4
desired_capacity = 2
vpc_zone_identifier = aws_subnet.web_private_subnet.*.id

health_check_type = “ELB”
health_check_region = var.region

instance_lifecycle_hook {
lifecycle_hook_name = “web-server-lifecycle-hook”
role_arn = aws_iam_role.asg_lifecycle_role.arn
default_result = “SUCCESS”
notification_target_arn = aws_sns_topic.asg_notifications.arn
heartbeat_timeout = 300
}

timeouts {
create = “300”
delete = “300”
}

tags = {
Name = “web-asg”
}
}

resource “aws_autoscaling_policy” “web_scaling_policy” {
name = format(“web-asg-cpu-scaling-policy-%s”, var.region)
autoscaling_group_name = aws_autoscaling_group.web_asg.name
policy_type = “TargetTrackingScaling”

target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = “ALBRequestCountPerTarget”
}
target_value = 100.0 # Adjust target value based on your application load
}

cooldown = 300

tags = {
Name = “web-asg-cpu-scaling-policy”
}
}

SSM Configuration for Patching

resource “aws_ssm_document” “os_patching” {
name = “OSPatchingDocument”
content = <<EOF
{
“schemaVersion”: 1.2,
“description”: “Patch Operating System”,
“runtimeConfig”: {
“operatingSystem”: [
“AmazonLinux”,
“Ubuntu” # Add other supported OSes if applicable
]
},
“steps”: [
{
“action”: “aws:runCommand”,
“name”: “Install missing packages”,
“inputs”: {
“Command”: “yum update -y” # Adjust for your package manager
}
},
{
“action”: “aws:reboot”,
“name”: “Reboot machine”,
“when”: “SUCCESS”
}
]
}
EOF

# … other configuration options for approvals and scheduling
}

resource “aws_ssm_patch_baseline” “os_patching_baseline” {
name = “os-patching-baseline”
approval_rule {
approve_after_days = 1 # Adjust approval window if needed
patch_rule {
# … define patch filtering rules (optional)
}
}
operating_system {
# … specify OS details (e.g., name, version)
}
source {
# … define alternate patch sources (optional, Linux only)
}
# … other configuration options for global filters

# Target specific instance groups (optional)
dynamic “instance_group” {
content {
search_expression = aws_instance_group.web_servers.*.tags.Name == “Web Servers”
# … or target by instance IDs
}
}
}

# Optional: Associate Patch Baseline with specific instances
resource “aws_ssm_association” “os_patching_association” {
count = aws_instance.web_server.count # Adjust instance selection
name = format(“os-patching-%s”, count.index)
instance_id = aws_instance.web_server[count.index].id
targets {
key = “aws:InstanceIds”
value = aws_instance.web_server[count.index].id
}
document_id = aws_ssm_document.os_patching.arn
parameters = {
# … define parameters for scheduling and approvals if needed
}
}

Conclusion

Combining a Well-Architected Framework with Infrastructure as Code empowers organizations to build robust, secure, and scalable IT environments. This approach fosters efficiency, agility, and innovation, allowing businesses to adapt to evolving technology trends and achieve their strategic goals.

Please comment with your inputs on how the architecture can be further improved.

--

--

Lokesh Aggarwal

Lokesh is a technology blogger with a passion for Artificial Intelligence (AI), Applications, Service and Program Management in the enterprise world