A Critique of Network Design

With the advent of datacenter-level computation, networks have become key. In datacenter networks, the concerns of network engineers still primarily lie at layers 1–3. Occasionally, layer 4–6 will be inspected upon transforming packets, or making routing decisions. I’ll leave layer 1 out of this, as it’s well isolated, and well modeled at this point as asynchronous serial links. There seems to be a contention at layer 2 & layer 3 — where network builders are largely moving away from layer 3 on the fabric, as seen in draft-lapukhov-bgp-routing-large-dc. Yet, there is a massive contingent still looking to build layer 2 networks, or layer 2 overlays atop these layer 3 fabrics with technologies like VXLan. With the new ubiquity of layer 3 forwarding elements, and their reduced cost, building layer 3 topologies has now become possible. Here, I’ll dive into why these two camps, and what we can do to enable networking to move forward.


What do layer 2 only network designs look like?

Layer 2 network designs have been well-studied, and are ubiquitous throughout datacenters at this point. As layer 2 is switched, and relies upon flooding for learning the location of actors throughout the network, the network must be reduced minimum spanning tree, without sender-based optimization. This was typically implemented using spanning tree protocol, which has had many extensions to enable for further scalability, and faster convergence times. Because the network had to be reduced to a tree, the network is largely limited to vertical scalability, leading to fat-tree topologies. Layer 2 addresses are tied with layer 3 IPv4 IPs via a protocol called ARP. This lead to very effective host mobility protocols based upon gratuitous arps, allowing all network devices to learn the new IP to MAC bidirectional association. Technologies like TRILL have been proposed to enable further scaling of layer 2 networks, but they aren’t widely implemented, and are still considered immature. Many vendors have proposed proprietary technologies to build large layer 2 networks without the use of STP, and enable horizontal scalability, but interoperability of these protocols between vendors is highly limited. Security is typically implemented by mapping security zones to VLANs, and then at the layer 3 gateway, using ACLs or a stateful firewall for filters.

What do layer 3 only network designs look like?

Layer 3 network designs typically take advantage of dynamic routing protocols like BGP, or OSPF in order to determine the location of a endpoint. They then use some combination of protocol timers, BFD, and link-state detection in order to trigger reconvergence of the routing state. Also, they take advantage of load balancing technologies like ECMP, to spread the load across multiple paths, enabling horizontal scalability, therefore making Clos topologies ubiquitous. These topologies are typically 3, or 5 stage tall, and they allow for single-element designs.


What’s wrong with Layer 2 networks?

Although layer 2 networks are well-understood, they’re highly-limited in their capability. Many operators have experienced failures that consume an entire broadcast domain, and those sitting adjacent to it (http://queue.acm.org/detail.cfm?id=2655736). In addition, horizontally scaling a network utilizing STP is very difficult, as adding links will only result in them being blocked, as opposed to being able to balance across them. Spanning tree also results in the path between two hosts potentially being non-optimal. Although, vendors are developing technologies like M-LAG, they are still vulnerable to control plane faults resulting in non-availability of the path. Layer 3 gateways that sit upon layer 2 networks also need to implement a mechanism for maintain availability, like VRRP. This requires both gateways to be able to heartbeat one another, and coordination for perfect network utilization. This is typically implemented through highly-redundant links between the two actors, that are idle, except during the failure scenario. The cost of these additional east-west links can quickly run out of control. Lastly, the control plane of each element in the network must have an up-to-date view of every active host in every active VLAN, which presents difficulties around scaling large datacenter topologies with tens of thousands of endpoints.

Tracing flows, and debugging in layer 2 networks is difficult, as there exists no standardized, widely-implemented protocol for tracing packets as they go from layer 2 hop-to-hop. In complex, multi-hop networks there exist tens, if not hundreds of potential paths that a specific flow can take. By utilizing layer 3 techniques like traceroute, you can easily determine which interfaces, and links are potentially flawed.

What’s wrong with Layer 3 networks?

One of the primary critiques of layer 3 networks is the mobility of well-known IP addresses. Well-known IPs, like gateways, edge load balancers, DNS servers, and infrastructure components are typically moved around the network using technologies like VRRP, Heartbeat, Keepalived, and Pacemaker. They then flood the network with a gracious arp to push the new location of the endpoint to the network. Unfortunately, in layer 3 networks, machines cannot move IPs to other network segments without a signaling protocol like BGP, and unfortunately, most machines are not capable of announcing IPs.

Although ToR availability is fairly good in most networks, according to the review that Microsoft Research conducted (http://research.microsoft.com/en-us/um/people/navendu/papers/sigcomm11netwiser.pdf), it’s convenient to be in an environment where you can still do maintenance on ToRs, even in the case where the application isn’t build for environments where the network can be temporarily partitioned. Host high availability is a well-understood problem in layer 2 networks. The approach is to typically used is to put two network adapters in the same layer 2 broadcast domain, and then rely on a failure detector like arp-based ping in order to determine if the adapter has gateway reachability. When a link is detected to be down, the IP is moved to the other adapter. Unfortunately, due to the same problem as host mobility, this is impossible, as there exists no ubiquitously deployed host mobility protocol.

Segregation in layer 2 networks is typically implemented using firewalls or ACLs at the gateway. Without the usage of a complex addressing scheme (should we talk about the complex addressing scheme?), rule counts can quickly grow exponentially, and exhaust the ACL limits of commodity hardware. Without VLANs, traditional segregation becomes difficult without choke points, and end-to-end network segregation.


Given what we give up when moving to a layer 3 design, the question is often asked, “How do build it?” and it is quickly followed up with “Is it worth it?” It is possible to build layer 3 networks with nearly all of the features of layer 2 networks on top of commodity technologies.


The simplest approach to segregation is effectively mirroring the approach used in layer 2 networks — Virtual Routing & Forwarding, also known as VRFs. The approach is to put map every zone to a VRF, and then use PBR plus route-leaking to selectively forward traffic between VRFs. This approach is doable on commodity hardware such as Juniper’s QFX5100, Trident II-based switch. VRFs are extendable between two nodes by utilizing trunk-configurations, and multiple BGP adjacencies. Due to the high-replication factor of the FIB, and RIB, it can quickly exhaust the resources available on most commodity switches. If the number of zones is relatively high, this method can quickly become problematic.

Every IP packet has 6-bits that typically stand unused. These bits are the differentiated services code point (DSCP) tag. Typically, they are used to mark packets to be handled for a specific QoS class. In most modern networks, DSCP isn’t actually needed, or there are few actual DSCP classes that need to be used. On ingress to the fabric, traffic can be marked with one of 63-code points, as one is needed to be reserved for control plane traffic. These DSCP code-points can be mapped to security zones, as well as being mapped to their own queues for traffic segregation. ACLs are applied on egress edge, and given a specific host, they only need to check the origin DSCP tag. This approach is doable on commodity hardware, such as Juniper’s QFX5100. Several downsides exist here. The upper-bound of the number of security zones is limited by 63-bits. The complexity of such a setup, and debugging is non-trivial. Lastly, packet must make it all the way to the ultimate hop before being filtered, as the origin, and intermediary hops do not know about the destination port that the packet will land upon.

Overlays are in vogue right now. With technologies like VxLAN, NVGRE, and OTV becoming ubiquitous for layer 2 overlays, there are still technologies that are applicable to layer 3. Unfortunately, most network OS vendors do not expose the layer 3 overlay capabilities of their switches. Though, technologies in development like E-VPN will allow you to use a variety of encapsulation protocols on the edge of your fabric to enable overlays. This allows for service-provider style configurations, while seamlessly leveraging existing technologies like MPLS, and enable the familiar concept of PEs to ToRs. Although overlays are a promise for the future, and in the capability set of the underlying chipsets, like Trident II, most vendors still don’t have support for layer 3 overlays.

Firewall Lastly, host-level firewalls are still very much an option. Segregation primarily exists to preserve security, and prevent traffic from reaching a host. Once a system is brought to the point where it has a stable operating system provisioned, host-level firewalls can do most of the work. This approach has been taken by successful companies such as Instagram (http://instagram-engineering.tumblr.com/post/89992572022/migrating-aws-fb). Having a flat fabric, and relying upon hosts themselves typically provides the easiest to debug, and most scalable option. In addition, it doesn’t introduce complexity to the fabric itself.

Mobility & HA

The observation that the locator and identifier are the same piece of information has been made before, but yet has there to be a ubiquitous solution to this problem. IP has had a mobility story for some time by leveraging mobile IP, it has proven complicated, and very few pieces of commodity hardware have actually implemented the control plane components necessary to actually utilize mobile IP. In addition, other proposals such as LISP have come up, but haven’t managed to gain any adoption. The control plane complexity that’s in these existing implementations has made them problematic to implement.

The observation that BGP in fact is a locator signaling protocol, and is already capable of being deployed as the fabric protocol enables us to build host mobility. By extending BGP peering down from the ToR to the host itself, you then allow the host to carve out IP space that is not tied to a real interface. By enabling the host to signal the location of a specific identifier via utilizing BGP, you then gain the same capabilities that exist in traditional layer 2 networks. BGP can itself be used as a liveness detection protocol, or it can be paired with another liveness detection protocol like BFD. This same mechanism can be used for gateway mobility, as the default egress route can be propagated down from the fabric as well, and the host can make local routing decisions, based on the changes in the fabric.

Overlays can also be used to emulate host mobility, by extending the control plane down to the host. There are implementation of VXLan, a layer 2 overlay protocol that are production quality that run on commodity Linux hosts. Unfortunately, tying the layer 2 overlay back to the traditional fabric typically requires VXLan routing capabilities, which are not yet ubiquitously deployed across commodity hardware vendor network operating systems, and are not available in all chipsets.


Layer 3 networks provide a superset of the capabilities that layer 2 networks do, while allowing for more flexibility, scalability, better utilization, and better capex, and opex. Although the approaches for layer 3 networks may not be as mature as those in layer 2 networks, the approaches can still emulate those in layer 2 networks, given certain tradeoffs. As layer 3 capabilities have become ubiquitous in modern datacenter switches, the cost is no longer prohibitive. Layer 3 networks offer a real alternative to fat-tree networks, and allow tiers of the network to scale out, and build Clos topologies, allowing for lower cost, incremental growth, and better utilization. As new datacenter networks are implemented, they should start with a layer 3 fabric, and only implement native layer 2 networks as the exception, not the rule.