Reflecting (pun intended) on the Level3 Outage

Matt Levine
3 min readSep 4, 2020

--

If you’ve found yourself reaching this post, by now you’ve read the Level3 (3356 is Level3, sorry CenturyLink brand fans) RFO that blames a fat-fingered flowspec policy, and a lack of sanity checking in their software for breaking the internet and ruining everyone’s Sunday.

Having said that — all of that can be excused. It happens. It was a bad day to be a Level3 customer. However, it really should’ve only been a bad day if you were a singled homed Level3 customer. You should’ve been able to shut down your BGP sessions with Level3 (or have the flowspec rule shut them down for you ;)), and go make breakfast/lunch knowing that you were no longer seeing 80% packetloss, and re-enabled it after the Level3 NOC gave the all clear.

That, however, was not what happened. Instead, if you shut-off your Level3 sessions, that was *worse*. You went from 80% to 100% packetloss..Level3 was just continuing to announce your routes to Peers that they no longer carried within the network. Once/If you figured out what was happening, you could tag no-export to get back to 80% packetloss.. but the only actual way to get your prefixes away from 3356 was to advertise more specifics to someone else.

Full disclosure: I do not have any idea what it takes to run BGP route-reflection at the scale needed for the Level3 network. One of the nice things about running CDN infrastructure is that every pop is, or can be, an island.

Having said that:

IT IS FLAT OUT UNACCEPTABLE THAT ROUTES WITHDRAWN FROM WITHIN THE LEVEL3 NETWORK WERE CONTINUED TO BE ANNOUNCED TO PEERS FOR *HOURS* AFTER THEY WERE REMOVED FROM THE FIB.

IT IS EVEN MORE UNNACEPTABLE THAT IT IS OMITED FROM THE RFO WITH NO MENTION OF HOW IT WILL BE REMEDIATED IN THE FUTURE.

I understand that the issue was getting stuck with an almost infinite backlog of route convergance and subsequently never withdrew anything from peers.

When a customer turned off their session, the core of the network removed the routes as the expected behavior, however the edge of the network continued carrying those routes, pointing them
into the core..which had nowhere to go. The routers ‘inside’ the network itself were still able to reconverge quickly, but announcements to peers were not withdrawn more than 60 minutes after they became unreachable internally.

This inconsistent state was the primary challenge for multi-homed customers, it has to be addressed to provide confidence to both existing and future customers… and it’s not even mentioned in the RFO?

This conversation is happening in private countless times this week as Level3 attempting to soothe irate customers; but given the scope of impact this had across many _good_ operators who attempted to do the right thing by shutting off their ports, the silence is defeaning.

But — a quick reminder to those of us who think we’ve seen everything.. we haven’t…..And always leave yourself the option of advertising more specifics, you never know when you’ll need to ;)

PS..is bgp safe yet? Probably not, atleast until you can sign upstreams in your aspath with RPKI and stop a tier1 from hijacking traffic to prefixes you stopped announcing.

--

--