How I Learned to Stop Worrying and Love the Access Control

Why are Access Control Lists so hard?!

The Maledictive Engineer
8 min readAug 15, 2020

--

We’ve all been there. There’s a critical delivery, application teams are in a panic requesting resources at the last minute to meet a deadline that they had months to plan for, but somehow still are scrambling. Project managers are whining to anyone in a position of power about how the infrastructure teams are all standing in the way of meeting deadlines. Then the inevitable happens — a developer is unable to access a particular resource, and immediately the call rises around the office, you read the indignity in the slack messages, you hear the exasperation on the zoom calls — someone failed to open up the ACL.

“How could this happen? I told you to copy the access from my other host!”

The subtext is clear. How could you be so incompetent? Why is it that Access Control Lists are always wrong, can’t you do your job!

And they are correct. We are terrible people. As network engineers, we do enjoy the power we hold over an organization. No engineering team is so reviled, yet so critical to the function of an organization (well, maybe security). No team has to understand so thoroughly how every aspect of the business operates, while at the same time having no fucking idea what anyone else is doing.

I need to understand how your application behaves, I need to understand why your application does what it does, but only at a very macro level. I need to know how your processes interact with other processes, I could give two fucks what the result of those processes are (unless of course they need to interact with a third process).

I digress. To understand why we suck at ACLs, we need to understand why ACLs are shit to begin with.

Problem #1:

ACLs are largely stored in device configurations. Routers, Firewalls, Load Balancers. Most of these things have configuration files that are best suited for very specific functions. Need to route traffic around your infrastructure, great, put in a router, establish your peers, set your route policies and steer that traffic baby. What that thing DOES is pretty clear. You can issue concise commands and get direct information about the operation of that device. Add policy enforcement to the mix and whoa there Bessy, you done fucked up.

Firewall configurations within the data center can (and will) quickly grow to hundreds of thousands of lines.

What if you have applications with huge latency dependencies, and someone at some point said, OK great, lets just use our switches to apply the access filters in TCAM at line speed. Oh fuck you. Now you done fucked up twice. Hammering a square peg in a round hole with that one. Switches are not designed to manage massive ACLs, they often have restrictions in the amount of grouping they can accommodate (and trust me, you need to be able to implement a sane object-grouping strategy — more on this to come). Once a config starts to become unwieldy, the problems in managing your access policies are just going to snowball.

Through all my years of education, never once was I sitting in class, and a professor would pause and say “That is how an Access Control List functions, now lets talk about how to manage the access control policies long-term. Because if you don’t, hoo-boy, at some point in your career, everyone is going to hate you”.

Problem #2:

Responsibility. Application owners are responsible for making the application work. They have a shit job too, so I don’t blame them on this one. The firewall rules that permit application communication are part of the application. I have never started as a network engineer at an organization where this is the case. I’ve heard mythical stories of organizations that actually have a security team responsible for the administration of ACLs, but in practice, its always been neteng’s responsibility. Security often gets to wield their giant random ban-hammer over the ability to implement a particular ACL, but they are always one step removed from the application teams and the network team. Having a security review and approval process exacerbates a bad ACL strategy. By being human beings, they are therefore as shitty at their jobs as the rest of us. They will arbitrarily approve or deny ACLs depending on their mood, or if their coffee was burnt. They carry their own heir of superiority that comes with being the gatekeepers of pretty much everything anyone else is allowed to do (with the exclusion of their own team, of course).

Problem #3:

Risk aversion. Everyone is goddamn petrified of deleting firewall rules. You can check all the fucking netflow exports, packet-captures, hit-counters, whatever the fuck you want to prove to the application team, your boss, whoever, that an ACL is not being used. As soon as someone somewhere removes an ACL line and something breaks, suddenly no one will ever let you remove another firewall rule ever for any reason. Firewall configurations are stale. Often they are confusing as shit as well. Obfuscated object groups with meaningless names, nested object-groups that act like Russian Dolls, mixed deny and permit rules spattered throughout the policy to make visual inspection of the configuration impossible. Compound that with generations of engineers adding rules with their own idiosyncrasies and bad habits and you have a recipe for disaster. You manage a firewall 1000 times with no impact, good for you, way to do your damn job. You fuck up once and miss a dependency or pull the wrong line, and that’s all anyone will remember. They don’t care how shitty and tedious managing ACLs are. Toughen up buttercup, don’t break my shit.

OK, so how do we fix it?

I’m glad you asked. Because its possible, and its actually not hard. It takes some buy-in from the organization, and there will be a period of pain to get to a sane policy. Often if the existing rule set is already out of control, it turns into a years-long effort. Installing a new hardware platform is a great time to start a greenfield deployment of the new firewall ACL strategy. Let the legacy wither and die and every time someone complains, steer them to the better way.

1. Understand that your firewall is going to manage access for lots of things for a very long time.

Physical network infrastructure often has expected lifecycles of 7+ years. Figure out a way to organize the chaos.

Develop an organization strategy that works and STICK TO IT, even if there is some push back. Exceptions are the enemy here. If you ever want to get to where you are automating this rule set, think like a Mandolorian — “This is the way”.

Here are some general guidelines I have found effective:

  • Create object groups that have meaningful names. Don’t use change ticket numbers or other referential information here that become impossible to decipher. If you look at large organizations, how many applications do you actually support…hundreds, thousands? In the scheme of things, the number of applications is not that large. I find that creating object groups referencing the application they support is best
  • Name your object groups over the service they DIRECTLY support. So application x (appX) consumes LDAP as part of its service. Even if the LDAP currently only supports appX, name the LDAP group for the LDAP service it is protecting, not the appX it is supporting by proxy. As soon as application Y (appY) comes along and also needs to talk to LDAP, you’ll understand why it was a good idea. Always think of an object-group in relation to the simplest element the ACLs referencing that object-group will be protecting.
  • Use object-group nesting effectively. Too-deep of nests are a nightmare to maintain. No nesting limits the ability for you to effectively manage your ACL ruleset. I find that a 2-level nesting strategy works for 95% of situations. I never nest more than a 3rd level.
  • Level 1 Object group — contains groups of like things (e.g. appX web servers). This object group should contain no nested objects, and contain every like entity the group represents. The objects within the group should be interchangeable in function for the scope of the object group
  • Level 2 Object group — this is an access group that only contains Level 1 Object groups. From our example above you would have the following structure
Object Groups: 
appX_app_servers (level 1 object group)
1.1.1.1
1.1.1.2
1.1.1.3
appY_app_servers (level 1 object group)
1.1.2.1
LDAP_servers (level 1 object group)
2.2.2.1
2.2.2.2
LDAP_server_sources (level 2 object group)
appx_app_servers
appy_app_servers
ACL:
LDAP_server_sources > LDAP_servers TCP389, TCP636 Permit
  • Even if a level 1 object group only contains a single host, go ahead and follow the standard. Create the object group and then add that object group into any level 2 object group that will grant the access it requires.

2. Work with the application teams to ensure they understand it is in their best interest to “own” the object-groups that represent their service.

Often application teams already have to create network diagrams of their application to use as a reference. It’s almost never detailed enough to include firewall object groups. If you can get the firewall object group names that represent the service nodes in a network diagram that the application team owns, you have won half the battle. Security will sign off on the design, and everyone is on the same page.

When the requests start coming in asking to “add 1.1.1.4 to appX_app_servers” suddenly there is no ambiguity. This is a much better relationship with your application teams then getting a request that says “copy access from 1.1.1.2 for host 1.1.1.4”

Which one of those conversations sounds easier to automate?

If you can get security sign-off on the firewall groups that represent an application during the POC and pre-prod deployments, there is never a question if the access will be permitted in the future. The access has already been permitted. Is 1.1.1.4 an appX_app_server? Yes, well then it automatically inherits all the access policies of the other appX_app_servers just by being placed into the object group that represents those servers.

3. Deny by default, permit by exception

The default access policy for security enforcement should be deny (edge ACLs are a different beast for a different conversation). You should fight tooth and nail to keep every line of ACL being a permit. There are situation where stacking a narrow permit over a broad deny can accomplish an ask without much thought, but it sets a bad precedent. If at all possible stick to this rule. You can almost always accomplish the same access by only permitting the exact traffic without having to resort to inserting a deny in the middle of your rule base. Having a strict but simple access policy will be much easier to automate down the road. Explicit permit, implicit deny all. That way the only thing in your ACL is the traffic you want to know about, not the traffic you want to exclude knowing about.

A flip to this coin is negating access. Don’t do this either. Some firewalls allow you to use negation (NOT).

e.g. NOT appY_servers > LDAP_servers TCP389 / TCP636 Permit

Checkpoints specifically come to mind, but I am sure there are others. This is just as bad as a deny buried in your code. You want to define good traffic, as opposed to excluding bad traffic. Let the default deny block everything.

Firewall ACL Tools:

There are many security products that attempt to do some of this work for you. I have used some of them with varying levels of success (and frustration). Cisco CSM, Cisco Tetration, Juniper Space, Palo Alto Panorama. Each of these tools has their efficiencies, but build your solid foundation of access strategy, object-group membership, and shared ACL ownership, and then allow these tools to manage some of the administration of implementing the policies in sane way.

Just trying to do my part to make us network / security engineers a little less hated.

--

--

The Maledictive Engineer

Foul-mouthed Network Engineer focusing on Infrastructure Automation.