IPTables and Docker

Edouard Buschini
5 min readFeb 24, 2019

--

In this post I will be talking about the nightmare of all Ops people that have to deal with Docker.

We all know that Docker is awesome.

It makes our lives really easy, but there is one problem. It works with IPTables for who don’t know the default firewall on Linux .

Docker creates IPTables rules for you and it becomes really hard to manage if you need to control what goes in and out your server when you install Docker in production.

The issue

Let’s say you have a container that listen on port 443 . You only want to allow traffic from your load balancers as it handles some of the security for you. Nothing really hard right?

The naive approach is to create a rules on the default INPUT chain which will have kind of the following:

iptables -A INPUT -p tcp --dport 443 -s 172.16.0.0/26 -m state --state NEW,ESTABLISHED

This rule says: allow new and established inbound traffic from the 172.16.0.0/26 network to the port 443 on the tcp protocol.

You put your iptables -A INPUT -j DROP at the end and then you are happy because you think it works! So you try from your machine and the port is still open for you. Hummm, weird?

Not that weird. The issue here is that since Docker creates interfaces for the container when you don’t specify --net=host . Those interfaces have an IP address on it. They usually are using the 172.17.0.0/24 network. And the most important of all, they are only routable from the host, not to the rest of the network — that’s why you do -p to expose the port so the host will listen and forward the traffic to the container.

Forwarding traffic 101:

Each container invocation will create a rule looking like this:

iptables -A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp — dport 443 -j ACCEPT

Which is the exported port and says that accept everything that does not come from the docker interface to the docker interface to the ip of the container.

This DOCKER chain is referenced in the FORWARD chain like this: -A FORWARD -o docker0 -j DOCKER . The FORWARD chain is there when traffic is transferred from interfaces to interfaces.

Chain DOCKER (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT tcp — !docker0 docker0 0.0.0.0/0 172.17.0.2 tcp dpt:443

Well as you can, this is not it because there are no packets that have match that rule so far!

We need to dig deeper. So let’s take a step back and understand how the iptables filtering and ordering works. A quick google search yields the following:

https://n0where.net/how-does-it-work-iptables

So the INPUT chain is only processed after deciding if the packet needed to be nat’ed or not.

It is clear that there is something else in the process.

What I did not know is that there were multiple tables in iptables ! The default table is called filter and it’s the most used one.

But packets are processed by the nat tables first!!

-A PREROUTING -m addrtype — dst-type LOCAL -j DOCKER

Dammit, everything is routed to the DOCKER chain in the nat table!!!!

-A DOCKER ! -i docker0 -p tcp -m tcp — dport 443 -j DNAT — to-destination 172.17.0.2:443

And here we go, the packet’s destination is changed to 172.17.0.2:443 , so any filtering on INPUT will not work…

How are we going to be able to block the traffic without touching to Docker.

Some people have talked about the DOCKER-USER chain, which would do the work, but you kind of have the same problem because of the NAT. Some other people said to deactivate the Docker feature to maintain the rules directly. This is a really bad idea, as you don’t want to re-invent some intelligence that will do that for you.

Remember? You only want to protect your server, not mess up with actual workflow. Well my friends, I have the solution for you.

It’s from a bag of tricks. We need to act in the nat tables in order to block stuff.

The idea is quite simple:

  • We start by creating a chain called DOCKER-BLOCK : -t nat -N DOCKER-BLOCK
  • Then we inject on top, blocking everything in the PREROUTING chain: -t nat -I PREROUTING -m addrtype — dst-type LOCAL -j RETURN
  • Then we inject another rule on top, this one jumps everything to DOCKER-BLOCK : -t nat -I PREROUTING -m addrtype — dst-type LOCAL -j DOCKER-BLOCK

At this point the flow is like this:

PREROUTING -> DOCKER-BLOCK -> RETURN -> (the rest is unreachable) DOCKER

So everything is blocked by default!

Now the trick is to add rules one by one.

  • -t nat -A DOCKER-BLOCK -p tcp -m tcp — dport 443 -m state — state NEW -j DOCKER

Now the workflow is:

PREROUTING -> DOCKER-BLOCK -> DOCKER -> (unreachable) RETURN -> (even more unreachable) DOCKER

We successfully bypassed Docker by jumping back to it when we were allowing the connection.

So people would tell me: how do you deal with flushing and persistence Edouard?

Well my friend, I would tell you that everything is under control. My script works before, while and after Docker.

The idea is to create a shell script where you put your rules.

It is possible to flush only one chain from one table, but not possible to restore only one chain from one table. We have to improvise.

The danger here is two fold:

  • We don’t want to allow traffic while we are reloading
  • We don’t want to interrupt existing connections while we are reloading
  • We don’t want to mess up with Docker

The code is on github but I’ll go over it quickly:

  • We create the DOCKER-BLOCK chain in case it does already exists
  • We add our two custom rules to the PREROUTING chain, there are 4 rules now.
  • We delete 2, resetting to only 2 rules (otherwise it adds 2 every time)
  • Then we create the DOCKER chain as we need it to be referenced in case the Docker daemon did not create it yet
  • We let all the established connection go through
  • Then we flush the DOCKER-BLOCK chain: at this point no new connections can be made, that’s OK this the application will try to send SYNC packets multiple times
  • Here we add our custom rules which should restore the traffic.

And here we go! Super clean iptables rules that will be always idempotent!!

Last point: this only works for containers that don’t have the --net=host . If you are using the host networking stack, you will have to deny the traffic using the usual INPUT chain.

I hope you enjoyed this exercise, leave a comment or reach out to me https://twitter.com/moonbocal if you need!

https://gist.github.com/tehmoon/b1c3ae5e9a67d66186361d4728bed799#file-iptables-reload-sh

--

--