Tag Archives: SDN

Viptela SEN – DMVPN Done Right

Recently I had the treat of listening to two Layer 3 routing protocol maestros when the CTO of the startup Viptela, Khalid Raza, appeared on Ivan Pepelnjak’s Software Gone Wild podcast. Interestingly, the first time I had ever heard of Khalid or Ivan was through the Cisco Press books that they each authored. Ivan had the famous ‘MPLS and VPN Architectures‘ and Khalid, one of the first CCIEs, wrote ‘CCIE Professional Development: Large Scale IP Network Solutions‘, (which I owned an autographed copy of).

In a nutshell, Viptela’s Secure Extensible Network (SEN) creates hybrid connectivity (VPNs) across the WAN. Their target market is any large retailer or financial company, that has many branches. Khalid and the founder Amir Khan (of Juniper MX product line fame), come from super strong Layer 3 background and, consequently, they don’t purport to have a revolutionary solution. Instead, they have harnessed that background to improve on what DMVPN has been attempting to solve for the past 10 years. In Khalid’s words, they have “evolved MPLS concepts for overlay networks”.

Viptela SEN comprises a controller, VPN termination endpoints, and a proprietary protocol that is heavily inspired by BGP. In fact, one of the advisors of Viptela is Tony Li, author of 29 RFCs (mostly BGP-related), and one of the main architects of BGP. Viptela SEN can discover local site characteristics (such as the IGP) and report them to the controller, which then determines the branch’s connectivity policy. So it essentially reduces the number of control planes, which reduces the number of configurations for the WAN. This looks incredibly similar to what DMVPN sought out to do a decade ago. Viptela calls these endpoints dataplane points, but they still run routing protocols, so to me they’re just routers.

DMVPN, itself, started as a Cisco proprietary solution, spearheaded by Cisco TAC, in particular a gentleman by the name of Mike Sullenberger, who served as an escalation engineer. He has since coauthored an IETF draft on DMVPN. In fact, one of the earliest tech docs on cisco.com touts how ‘for a 1000-site deployment, DMVPN reduces the configuration effort at the hub from 3900 lines to 13 lines’.

Getting back to Viptela SEN, the endpoints (aka routers) authenticate with the controller (through exchange of certificates). Different circuits from different providers (MPLs or broadband) can be balanced through L3 ECMP. Their datapath endpoints are commodity boxes with Cavium processors that can give predictable (AES-256) encryption performance that tunnel to other endpoints (via peer-to-peer keys) as prescribed by the orchestrator/controller. In the event of a site-controller failures, if a site still has dataplane connectivity to another site that it needs to communicate with, then traffic can still forward (provided the keys are still valid) and all is well though the entries are stale.

One of the differentiators between Viptela and others in this space is that they do not build overlay subnet-based routing adjacencies. This allows them to offer each line of business in a large company to have a network topology that is service driven rather than the other way round. Translated in technical terms, each line of business effectively has a VRF with different default routes, but a single peering connection to the controller. In DMVPN terms, the controller is like the headend router, or hub. The biggest difference that I could tell between Viptela SEN and DMVPN is the preference given to L3 BGP over L2 NHRP. One of the biggest advantages of BGP has always been the outbound attribute change in the sense that a hub router could manipulate, via BGP MED, how a site could exit an AS. It is highly customizable. For example, majority of the sites could exit via a corporate DMZ while some branches (like Devtest in an AWS VPC) could exit through a regional exit point. In DMVPN, NHRP (which is a L2 ARP-like discovery protocol) has more authority and doesn’t allow outbound attribute manipulation which BGP, a L3 routing protocol has been doing successfully throughout the Internet for decades. NHRP just isn’t smart enough to provide that level of control-plane complexity.

Viptela SEN allows for each site to have different control policies – it could be a control plane path that says

The flexibility that Viptela SEN extends to a site can be at a control plane path level (e.g. ensure that certain VPNs trombone through a virtual path or service point like a firewall or IDS before exiting, as done in NFV with service chaining ) or data plane level (e.g. PBR). Since it promises easy bring-up and configuration, to alleviate concerns about SOHO endpoint boxes being stolen, they have a GPS installed in these lower end boxes. The controller only allows these boxes to authenticate with it if they are in the prescribed GPS coordinates. If the box is moved, it is flagged as a potentially unauthorized move and second-factor authentication is required in order to be considered as permissible. The controller can permit this but silently monitor the activities of this new endpoint box without its knowledge, akin to a honeypot. That’s innovation!

White Box switch readiness for prime time

Matthew Stone runs Cumulus Networks switches in his production network. He came on the Software Gone Wild podcast recently to talk about his experiences. Cumulus, Pica8, and Big Switch are the three biggest proponents of white box switching. While Pica8 focuses on the Linux abstractions for L2/L3, Pica8 focuses more on the OpenFlow implementation, and Big Switch on leveraging white boxes to form taps and, more recently, piecing together leaf-spine fabric pods.

I believe white box switches are years away from entering campus networks. Even managed services are not close. You won’t see a Meraki-style deployment of these white box switches in closets for a while. But Stone remains optimistic and makes solid points as an implementer. My favorite part is when he describes how Cumulus has rewritten the ifupdown script, to simplify configuration for network switches (which typically are roughly 50 ports as compared to 4-port servers), and repackaged it as ifupdown2 to the Debian distribution. Have a listen.

Head End Replication and VXLAN Compliance

Arista Networks recently announced that its implementation of VXLAN no longer requires IP Multicast in the underlay network. Instead, the implementation will now rely on a technique called Head End Replication to forward BUM (Broadcast, Unknown Unicast, and Multicast) traffic in the VLANs that it transports. But first, let’s rewind to the original VXLAN specification.

Virtual eXtensible Local Area Networks were first defined in an Internet draft called draft-mahalingam-dutt-dcops-vxlan-00.txt in August 2011. It took some time for switch vendors to implement it, but now Broadcom’s Trident II supports it. Of course, software overlay solutions such as VMware NSX and Nuage Virtualized Services Platform (VSP) also implement it. Three years later, in August 2014, this draft became RFC 7348. The draft had 9 revisions to it, so it went up to draft-mahalingam-dutt-dcops-vxlan-09.txt, but there are no significant changes with respect to Multicast requirements in the underlay. They all say the same thing in section 4.2:

Consider the VM on the source host attempting to communicate with the destination VM using IP.  Assuming that they are both on the same subnet, the VM sends out an Address Resolution Protocol (ARP) broadcast frame. In the non-VXLAN environment, this frame would be sent out using MAC broadcast across all switches carrying that VLAN.

With VXLAN, a header including the VXLAN VNI is inserted at the beginning of the packet along with the IP header and UDP header. However, this broadcast packet is sent out to the IP multicast group on which that VXLAN overlay network is realized. To effect this, we need to have a mapping between the VXLAN VNI and the IP multicast group that it will use.

In essence, IP multicast is the control plane in VXLAN. But, as we know, IP multicast is very complex to configure and manage.

In June 2013, Cisco deviated from the VXLAN standard in the Nexus 1000V in two ways:

  1. It makes copies of packets for each possible IP address at which the destination MAC address can be found, and sent from the head-end of the VXLAN tunnel, or VLAN Tunnel End Point (VTEP). Then these packets are unicast to all VMs within the VXLAN segment, thereby precluding the need to have IP multicast in the core of the network.
  2. The Virtual Supervisor Module (VSM) of the Nexus 1000V acts as the control plane by maintaining the MAC address table of the VMs, which it then distributes, via a proprietary signaling protocol, to the Virtual Ethernet Module (VEM), which, in turn, acts as the data plane in the Nexus 1000V.

To their credit Cisco acknowledged that this mode is not compliant with the standard, although they do support a multicast-mode configuration as well. At that time they expressed hope that the rest of the industry would back their solution. Well, the RFC still states that an IP multicast backbone is needed.

This brings me to the original announcement from Arista. They claim in their press statementThe Arista VXLAN implementation is truly open and standards based with the ability to interoperate with a wide range of data center switches.

But nowhere else on their website do they state how they actually adhere to the standard. Cisco breaks the standard by conducting Head End Replication. Adam Raffe does a great job in explaining how this works (basically, the source VTEP will replicate the Broadcast or Multicast packet and send to all VMs in the same VXLAN). Arista should explain how exactly their enhanced implementation works.

Linux as a Switch Operating System: Five Lessons Learned

Although this post is nearly a year old, it is still gold. Ken Duda, the CTO of Arista Networks described five lessons learned along the way of supporting Enterprise Operating System (EOS), the Linux-based switching operating system. They are listed as:

  1. It’s okay to leave the door unlocked.
  2. Preserve the integrity of the Linux core.
  3. Focus on state, not messages.
  4. Keep your hands out of the kernel.
  5. Provide familiar interfaces to ease adoption.

Definitely worth a read.

Harnessing the Raw Performance of x86 – Snabb Switch

Recently I was listening to an episode of Ivan Pepeljnak’s Software Gone Wild podcast featuring Snabb Switch that inspired me to write this post. Snabb Switch is an open source program, developed by Luke Gorrie, for processing virtualized Ethernet traffic for white field deployments using x86 hardware. It caught my attention because the recent announcements of Intel’s networking capabilities at IDF14 were fresh in my mind. Snabb Switch is a networking framework that also defines different building blocks for I/O (such as input/Rx links and output/Tx links), Ethernet interfaces, and packet processing elements leveraging x86 servers and Intel NICs. It speaks natively to Ethernet hardware, Hypervisors, and the Linux kernel by virtue of a user-space executable. The cornerstone of Snabb Switch is its super light footprint, which enables it to process tens of millions of ethernet packets per second per core. Moreover, it has been known to push 200 Gbps on an x86 server. Pretty impressive for an open source program.

Snabb Switch uses the Lua programming language, which is a lightweight scripting language that can make some function calls and change the configuration in real time. It leverages LuaJit, a Just-In-Time compiler that compiles Lua code for x86 in real-time while switching packets. This technology is used in the video games industry as well as high frequency trading in the financial industry, but not very prevalent in the networking industry yet. The biggest exception is CloudFlare, the CDN that optimizes website delivery by blocking DOS attacks.

Snabb Switch rides the wave of the vast improvements in hardware performance on x86 servers and NICs. In a nutshell, networking applications on Linux have been moved out of the kernel and into user space. It used to be that each packet arriving from the network to the NIC of an x86-based Linux server would be sent up to the kernel, which would then have to wake up, via an Interrupt signal, and process them before sending them out on the network. This was a very time-consuming process and it also made it very difficult for application developers to write networking code because it involved intricate knowledge of the kernel. However, with faster hardware, developers realized that with so many packets arriving each microsecond, waking up the kernel to process each packet was too inefficient. Instead, it became more prudent to assume a continuous stream of packets and setting aside a dedicated pool of memory for this traffic. In other words, the NIC is mapped directly with the memory of the user process. Snabb Switch does this by writing their own driver for the NIC (Intel NICs for now) that drives features such as an embedded Ethernet switch and QoS on around 850 lines of Lua code.

Generally speaking, people with networking backgrounds have traditionally assumed x86-based servers to be limited in their packet-processing capabilities (attributed to PCI bus bottlenecks, slow memory, slow CPU, etc). In reality, the raw performance that can be extracted from x86-based hardware is quite high. 800 Gbps can be attained from DRAM banks, 600 Gbps can be attained from PCI Express, and the interconnect between CPUs is also hundreds of Gbps. There is no reason one cannot attain 500 Gbps using a dual core Xeon server. The bottleneck is quite clearly the software. Of course this works best (10 million packets per second per core) for simple cases such as just sending packets in and out. But for slightly more complicated scenarios, such as accessing an unpredictable address in memory, performance can drop by an order of magnitude.

Snabb Switch is known to have generated 200 Gbps out of a single core at just 10% CPU utilization, which is quite incredible. The way that Gorrie did this is by reading in 32,000 packets into a PCAP file, pushing them out on 20 10G NICs, and programming those ports to run in a loop.

The outcome of Snabb Switch is quite similar to Intel’s DPDK, in which there is user space-based forwarding, no Kernel interrupts, and CPUs are dedicated to particular NICs. However, Snabb Switch is a lightweight platform for ground up designs, whereas DPDK is intended to allow developers, who have written applications that run inside the kernel, to port their mature code to user space. For newer application designs, user space development is more prevalent because of the higher traffic levels and performance expectations. Snabb Switch modus operandi is to poll the kernel for new packets to process rather than interrupting it. It runs a scheduler in a polling loop with multiple parallel traffic processes on separate CPUs.

Snabb Switch can also run as a high performance NFV switch for OpenStack environments. The way it can do this is by removing the kernel from the forwarding path and allowing the user space program to talk directly to the device driver on the guest VM. The VMs are only able to address their own memory that they have allocated themselves. A software switch cannot allocate memory to a VM. Instead, for each VM, a separate TX/RX queue in hardware is provisioned in the NIC. So when a VM gives a buffer for packets, the buffer is translated from a standard virtio format (in KVM) directly to hardware format. In other words, when a packet comes in from the network, the NIC determines which VM should get it (typically by looking up the destination MAC address and VLAN ID), picks the appropriate hardware queue with memory that belongs to that VM, grabs a buffer and copies the data from the NIC to that VM. Since Snabb Switch acts as the translation engine between standard virtio and native hardware on the standard Intel NIC, there is no need to write or install a specific device driver for guest VMs to access the hardware.

I believe that Snabb Switch has a lot of promise though it may take a while for deployments to be more mainstream.