VXLAN explained

by Jacob Taylor

This article provides engineers with a practical overview of VXLAN technology and its applications. It refers to VXLAN LAN encapsulation and not a specific control-plane technology. Where acronyms are not expanded, they can be found in the Glossary.

VXLAN — or Virtual Extensible LAN — technology is a new approach to network virtualisation designed to address the challenges of scaling networks in large cloud computing deployments. We use it at NTT ICT to deliver reliable and scalable data centre networking services to our managed service customers.

At its most basic level, VXLAN is a tunnelling protocol. In the case of VXLAN specifically, the tunnelled protocol is Ethernet. In other words, it’s merely another Ethernet frame put into a UDP packet, with a few extra bytes serving as a header — or a transport mechanism that supports the software controlling the devices that use the VXLAN.

So why is everyone talking about it? In the case of network virtualisation or software-defined networking (SDN), VXLAN is often used as a basis for larger orchestration tooling that synchronises state and configuration across multiple network devices. It also assists APIs with integration and automation. NSX, Contrail and Openstack Neutron are examples of systems that provide a logical front end to configure VXLAN across multiple devices.

VXLAN is a formal internet standard, specified in RFC 7348. If we go back to the OSI model, VXLAN is another application layer-protocol based on UDP that runs on port 4789.

Overlay or underlay?

Overlay and underlay are terms frequently used in SDN and network virtualisation. In terms of VXLAN, the underlay is the Layer 3 (L3) IP network that routes VXLAN packets as normal IP traffic. The overlay refers to the virtual Ethernet segment created by this forwarding.

For example, a L3 VXLAN switch (e.g. Cumulus), upon receiving a frame, may do any of the following:

· switch it locally if it is destined for a locally learnt MAC address (traditional Ethernet switching)

· forward it through a local VTEP, hence pushing it into the underlay encapsulated in VXLAN (in the overlay)

· route it at L3, pushing it into the underlay unencapsulated, which is just another IP packet.

An example of a VXLAN packet

Introducing VTEP

The term VTEP (VXLAN Tunnel Endpoint) generally refers to any device that originates or terminates VXLAN traffic. There are two major types, based on how the encapsulation or de-encapsulation of VXLAN packets is handled: hardware VTEP devices handle VXLAN packets in hardware, while software VTEP devices handle VXLAN packets in software.

Examples of hardware VTEPs include switches and routers such as Cumulus switches, as we use in NTT’s environment. Software VTEPs include servers and hypervisors such as NSX-enabled ESXi hosts.

More specifically, a VTEP can refer to a virtual interface similar to an SVI that exists on such a device. Such an interface will often connect to the local device’s internal bridge implementation and act as the local source of VXLAN frames and the destination for remote MACs.

Benefits of VXLAN over pure L2

Probably the greatest advantage a VXLAN solution has over a pure Layer 2 (L2) network is the elimination of the risks associated with L2 domains spanning multiple logical switches. For instance, an entirely L3 network with a VXLAN overlay is not susceptible to the spanning tree faults that have been experienced by some major Australian organisations.

Additionally, VXLAN is more scalable than pure L2, especially when control-plane learning is implemented, because excessive BUM (broadcast, unknown unicast and multicast) frame flooding is suppressed. This, combined with the fact that hardware VTEPs minimise the latency overhead of VXLAN implementations, means we can build a network that is more scalable and robust, without sacrificing performance.

Below is an example of CLOS topology and a VXLAN packet captured flowing over it, as it appears in Wireshark.

An example of VXLAN topology

VXLAN packet breakdown

Key:

· Underlay Source/Destination MAC addresses: these are the link-local MAC addresses for the underlay network. In this case, the source is that of swp1 of spine01 and the destination is that of swp51 of leaf01.

· Underlay Source/Destination IPv4 addresses: these are the IPs of the originating and destined VTEPs respectively. In this case, they are actually anycast IPs where the source is leaf03+04 and the destination is leaf01+02. They are not on the same subnet, they are /32 addresses associated to loopbacks on the leaf switches, advertised out via BGP.

· VXLAN header: this shows the VNI (10010) the frame is mapped to, and also the flags set.

· Overlay Source/Destination MAC addresses: these are the MAC addresses of the end-devices communicating over the VXLAN tunnel. From their perspective, they are both on the same broadcast domain. The source is server02 and the destination is server01.

· Overlay Source/Destination IPv4 addresses: these addresses are the associated IP addresses for the end-nodes communicating over the VXLAN tunnel and are both in the same subnet (10.0.10.0/24). Once again, the source is server02 and the destination is server01.

Advanced technical considerations

The behaviour of BUM traffic differs based on the deployment. BUM traffic is important for MAC address learning as well as protocols like ARP and Neighbour Discovery.

In general, there are two major approaches to handling this kind of traffic — data-plane learning and control-plane learning.

Data-plane learning is similar to traditional Ethernet Flood/Learn behaviour, where BUM traffic is flooded to all VTEPs terminating a single VNI, usually using multicast.

Traditional Ethernet networks use data plane learning as opposed to control plane learning, which is used more at the IP layer to share information for faster, optimised decision making.

Data-plane learning is effectively stateless, so it does not require the implementation of controllers. It can be configured automatically (like NSX when in multicast mode) or manually (like static VXLAN tunnels in Cumulus).

This is easy to configure and simple to understand, however it comes with the following caveats:

· if the implementation uses multicast flooding, it requires a multicast-capable network with appropriate routing protocols

· if the implementation uses unicast flooding, a source VTEP needs to send all BUM traffic to all other VTEPs in the VNI individually, resulting in duplication of packets

· in web-scale networks, forwarding all BUM traffic could cause significant strain on the underlay network.

Various different solutions exist to optimise remote MAC learning and address resolution, eliminating the need for data-plane learning. This often includes moving MAC learning into the control plane and suppressing the replication of ARP/ND traffic across the tunnel and using proxy or relay services.

A lot of commercial offerings use external controllers to direct the process of disseminating L2 information across distributed VTEP hardware and software devices. This includes solutions like NSX (in Hybrid and Unicast control plane modes) and MidoNet. This isn’t always necessary, however, as Cumulus devices support the standard RFC7432 EVPN address family for MP-BGP to synchronise the distributed state.

These solutions are more scalable and use less capacity on the underlay network, but have their own issues such as:

· they are more complex to set up and maintain

· they require use of different tools depending on whether hosts are local to a switch or behind a remote VTEP which people aren’t used to using

· they are resource-intensive because controller-based solutions require additional capacity to host the controller infrastructure.

Learn more

Feel free to connect with me on LinkedIn if you would like to discuss this topic or anything about networking.

Glossary

· API: application programming interface

· BUM: broadcast, unknown unicast and multicast

· ESXi: VMware’s bare metal hypervisor

· EVPN: Ethernet VPN

· IP: Internet Protocol

· MAC: Media Access Control

· MLAG: Multichassis Link Aggregation

· MP-BGP: Multiprotocol — Border Gateway Protocol

· NLRI: Network Layer Reachability Information

· NSX: VMware’s SDN solution based on VXLAN

· SDN: software-defined networking

· SVI: switch virtual interface

· VNI: VXLAN Network Identifier

· VTEP: VXLAN Tunnel Endpoint

· VXLAN: Virtual Extensible LAN