Demystifying AWS NACL

Koushik Shom Choudhury
8 min readApr 18, 2020

--

Apart from EC2 Security Group(s), a crucial distributed firewall component of AWS VPC is Network Access Control List or Network ACL. The idea is dead simple, what Security Groups does to your individual EC2 instances, NACL does (well, somewhat) to your VPC Subnets.

AWS NACL documentation picks up a dramatic start with:

A network access control list (ACL) is an optional layer of security for your VPC that acts as a firewall for controlling traffic in and out of one or more subnets.

Well, what they also mention implementing this optional layer actually brings a generic security flavor in the architecture.

Network ACLs act as a firewall for associated subnets, controlling both inbound and outbound traffic at the subnet level.

What it brings to the plate that Security Groups don’t? Well, besides being stateless firewall of Subnets, Network ACLs can block specific IP Address, where Security Groups fail.

Before we begin, it would be a real bliss if you’re already familiar with the following

  1. Launching an EC2 Instance into desired Subnet — To make it Public/Private.
  2. Using EC2 KeyPairs(.pem) to SSH into EC2 Instances
  3. SSH Agent Forwarding — To access other Instances without manually carrying the KeyPair
  4. Able to setup Custom VPC, Routing Tables for routing traffic into and from VPC
  5. Concept of Ephemeral Ports (try not to skip this one)
  6. Using OpenSSH on Linux (that’s what I’ll be using throughout)
  7. Installing and setting up Apache — To serve Web content

NOTE: You can use PuTTY for Agent Forwarding and KeyPair registration. I’ll be using OpenSSH.

Tiny disclaimer

  1. I’ve installed Apache on public EC2 instance and created a simple HTML page at /var/www/html as index.html with a single line of text: “Website accessed”
  2. I’ve created a security group named allow-all-sg which basically allows any sort of traffic over any IP through any Port, to make things simpler since we’re onto NACL and not Security Groups. allow-all-sg will server as the Security Group of all instances in our architecture.
Inbound Rules of allow-all-sg
Outbound Rules of allow-all-sg

Setting Up the VPC

  1. A VPC named koushikVPC (yeah, that’s me :P)
  2. An Internet Gateway to forward traffic from Public Subnet
  3. A Pubic Subnet (CIDR: 192.168.0.0/26) in AZ us-east-1a.
  4. A Private Subnet (CIDR: 192.168.0.215/27) in AZ us-east-1b.
  5. A Route Table PrivateRT associated with the Private Subnet with IPv4 routes: 192.168.0.0/24 local No
  6. A Route Table PublicRT associated with Public Subnet with IPv4 routes: 192.168.0.0/24 local No, 0.0.0.0/0 igw-internet-gateway-id No
  7. A NACL named StrictNACL that currently allow no traffic (Denies All)
PrivateRT & it’s routes
PrivateRT & it’s routes
Inbound Rules of StrictNACL
Inbound Rules of StrictNACL

Setting Up EC2 Instances

  1. Instance launched into Public Subnet, named WebServer. It retrieved a Private IP: 192.168.0.42 and a Public IP: 54.89.154.80
  2. Instance launched into Private Subnet, named DBServer. It retrieved a Private IP: 192.168.0.215
  3. Both are linked to a EC2 KeyPair called koushik.pem (yeah, that’s me!)
  4. Both are attached to Security Group ‘allow-all-sg’
EC2 instances

Setting Up Local Machine

  1. I’m using a Linux Distribution with OpenSSL installed
  2. I’ve used ssh-add command to register my KeyPair koushik.pem for the current session
  3. I’ve setup OpenSSH agent forwarding
SSH setup

That’s all! For the setup part though :P Uptil now, we have a Security Group that allows everything to pass through and a NACL that allows nothing. And we get the following:

Initial Architecture

To test the NACL, we’re going to SSH into our WebServer and face an inevitable setback:

SSH into WebServer

Yeah, I’ve fairly cheated.

ssh <USER>@<IP> -o ConnectTimeout=<SEC>

awaits for 10 seconds only before announcing the disability of the remote server to respond. My internet connection is way too fast to await response for 10 seconds hence I gambled. If you’ve no problem waiting, you can remove the -o ConnectTimeout=<SECONDS> part and wait till your SSH Tool times out by default (usually 45 seconds).

So where did things go wrong in the VPC which led to timeout?

Traffic denied as per NACL Rules

As we can see, no traffic (Inbound or Outbound) is allowed to flow between NACL and any of the Subnet. This is because we’ve set the default Inbound rule of NACL to DENY ALL.

How to resolve?

We’re going to impose a balanced set of rules to allow desired traffic but restrict unwanted traffic to pass through NACL to Public Subnet. So we add Inbound Rule:

100 SSH(22) TCP(6) 22 0.0.0.0/0 ALLOW

to allow SSH traffic to flow in from NACL to associated Subnets and connect to Port 22 of an Instance in any of the associated Subnets. And the following Outbound Rule:

100 Custom TCP Rule TCP(6) 1024–65335 0.0.0.0/0 ALLOW

to allow Instances in associated Subnets to respond to Ephemeral Ports of requesting machines.

NOTE: If Public IP of your machine does not bounce (mine does, this is handled by DCHP Systems of ISPs), you can go ahead and place your Public IPs instead of 0.0.0.0/0 to make this thing more secure.

Now let’s try to SSH into out WebServer again.

Successfully SSHed into WebServer

And there goes our first triumph. (Champagne, please :P)

That’s all fine. Let’s browse a Web page now (read #2 of disclaimer) by simply navigating to the Public IP of WebServer Instance.

Couldn’t browse our Website

That’s not good! But we allowed some traffic right?

Well, No! You allowed (as per Inbound Rule #100) only SSH traffic to flow in, and browsing to a web page (unsecure) requires allowance of HTTP traffic on Port 80 of the host (in our case, the WebServer). So, we create add a new Inbound Rule:

200 HTTP(80) TCP(6) 80 0.0.0.0/0 ALLOW

Let’s try again:

Website is now browsable

Wow! Victory!

Q: But, we didn’t add any new Outbound Rule, then….why…ummmm???

Yeah sure, Outbound Rule #100 says that no matter what the source IP or desired Port of the traffic is, if it is going to 0.0.0.0/0 (or, anywhere in the world) for Ephemeral Ports it will be responded to if the a reply is sent from behind the NACL, i.e, from any Instance. And, Inbound Rule #200 ensures HTTP request is replied to by any Instance within any of the associated Subnets.

Completing the Architecture

Have a look at current state of our architecture

Current architecture

Notice the cut at the end of the line joining Subnets and NACL, closer to the NACL?

That’s our next move!

NOTE: I’ve SSH agent forwarding enabled and I’m proceeding assuming the same for you.

Let’s try to SSH into the Private Instance, DBServer. Since, DBServer has no public IP (being within a Private Subnet), we’ll SSH into it from WebServer and not directly from our local machine.

SSH request to DBServer timed out

Oops! That’s a fallback.

Q: I’ve setup NACL correctly to allow SSH! Then why can’t I connect?

That’s exactly because when traffic arrives from behind the NACL (or from internal/Subnet facing side), the Inbound & Outbound S

urfaces of the NACL is reversed, and seeing the position of NACL in VPC, it is natural, I guess. So we’re going to view a new and slightly modified exposure of NACL:

NACL, exposed

As we can see, the Outbound Surface receives the requesting traffic from Subnets, which only allow traffic for Ephemeral Ports to pass through. Hence, we add the following Outbound Rule:

200 SSH(22) TCP(6) 22 192.168.0.0/24 ALLOW

To allow only SSH connection traffic from VPC (although 0.0.0.0/0 would make no difference as per our architecture, but it is better to be precise when allowing IPs).
Now let’s try again to SSH into DBServer

SSH request to DBServer timed out, again

:( But what’s wrong now?

Let us take a look at the present state of architecture once again

See that Outbound Surface allows SSH traffic, which means VPC router will successfully route SSH connection request to Port 22 DBServer, and eventually DBServer will respond to Ephemeral Port of requesting machine (which is the WebServer). This response traffic is routed by VPC router and falls onto the Inbound surface of NACL, where it gets discarded because no Inbound Rule allows traffic for Ephemeral Ports from any IP to pass through.

So we add the following rules:

300 Custom TCP Rules TCP(6) 1024–65335 192.168.0.0/24 ALLOW

To add traffic meant for Ephimeral Ports for Instances within the VPC to flow through (again, as per the present setup 0.0.0.0/0 would have made no difference).

And our architecture now looks like:

Allows Ephemeral ports on Inbound Surface

Now, we try again to connect to SSH into DBServer

Successfully SSHed into DBServer

Voila! And that’s how we NACL.

Loopholes of current setup/architecture

  1. Anyone can SSH into Public Instances because we’ve allowed 0.0.0.0/0 to send SSH request from Internet into our Public Subnet
  2. All rules that apply to Public Subnet, also apply in Private Subnet. To avoid this, it is always safe to use two NACLs.

Conclusion

When Traffic Origin is internet (or atleast external to the concerned VPC), Inbound & Outbound surfaces work accordingly but when Traffic Origin is somewhere within the concerned VPC, Outbound & Inbound surfaces reverse their roles.

PS: I’ll be upgrading the architecture by integrating NAT Instances & placing two separate NACLs, and thereby configuring them. So, stay tuned.

--

--