Category Archives: Ethernet

White Box switch readiness for prime time

Matthew Stone runs Cumulus Networks switches in his production network. He came on the Software Gone Wild podcast recently to talk about his experiences. Cumulus, Pica8, and Big Switch are the three biggest proponents of white box switching. While Pica8 focuses on the Linux abstractions for L2/L3, Pica8 focuses more on the OpenFlow implementation, and Big Switch on leveraging white boxes to form taps and, more recently, piecing together leaf-spine fabric pods.

I believe white box switches are years away from entering campus networks. Even managed services are not close. You won’t see a Meraki-style deployment of these white box switches in closets for a while. But Stone remains optimistic and makes solid points as an implementer. My favorite part is when he describes how Cumulus has rewritten the ifupdown script, to simplify configuration for network switches (which typically are roughly 50 ports as compared to 4-port servers), and repackaged it as ifupdown2 to the Debian distribution. Have a listen.

Advertisement

NBASE-T or MGBASE-T?

Last week I wrote about five new speeds that the Ethernet Alliance (the marketing arm of IEEE) are working on. The lower speeds 2.5 Gbps and 5 Gbps are called MGBASE-T and according to this post from the Ethernet Alliance, the MGBASE-T Alliance is overseeing the development of these standards outside of IEEE. This week, news broke out about leading PHY vendor Aquantia teaming up with Cisco, Freescale, and Xilinx to form the NBASE-T Alliance. This raises some questions about the work and causes that the MGBASE-T Alliance and NBASE-T Alliance are committed to.

Both NBASE-T and MGBASE-T are trademarks of Aquantia. Both the MGBASE-T Alliance and the NBASE-T Alliance are Delaware corporations. It appears as though the MGBASE-T Alliance was formed around June 2014, while NBASE-T Alliance is newer, September 2014.

The NBASE-T Alliance website defines the technology as follows:

NBASE-T™  is a proven technology boosting the speed of twisted pair copper cabling up to 100 meters in length well beyond the designed limits of 1 Gbps.

Capable of reaching 2.5 and 5 Gigabits per second over 100m of Cat 5e cable, the disruptive NBASE-T solution allows a new type of signaling over twisted-pair cabling. Should the silicon have the capability, auto-negotiation can allow the NBASE-T solution to accurately select the best speed: 100 Megabit Ethernet (100MbE), 1 Gigabit Ethernet (GbE), 2.5 Gigabit Ethernet (2.5GbE) and 5 Gigabit Ethernet (5GbE).

So what happens to MGBASE-T given that Aquantia was a part of both? My hunch is that it fizzles away and the other vendors who were working on it (no names here) lost in the race to Cisco, Freescale, and Xilinx.

Ethernet Alliance unveils five new speeds

This week Network World laid out some details of the work the IEEE group, the Ethernet Alliance, is doing with respect to new data rates. As mentioned in this blog post, while there are 5 shipping speeds of Ethernet (100 Mbps, 1 Gbps, 10 Gbps, 40 Gbps, and 100 Gbps), there are 5 new speeds that are currently being worked on (2.5 Gbps, 5 Gbps, 25 Gbps, 50 Gbps, and 400 Gbps). The last time Ethernet got this sexy was when promiscuous mode was introduced.

Some of the drivers for these new speeds are adoption rates of the older speeds. As detailed in the July 2014 IEEE Call for Interest , while the initial adoption for 10G, 40G, and 100G was in 2004, 2012, and 2015 (anticipated) respectively, because these speeds are turning out to be cost prohibitive, the transition to higher speeds has been slower than previously forecasted. For example, the 1G -> 10G transition has repeatedly moved out (from 2012 to 2014 to 2016 now). This creates a window where new technology can provide the higher port speed at lower cost. So, as an example, the SFP+ technology can be leveraged in 25 Gbps as a single lane and 50 Gbps as two lanes.

The 2.5 and 5 Gbps speeds (known as MGBASE-T) address the growing demands of BYOD in campus networks. Many of the newer APs nowadays ship with 802.11ac. This Wifi standard will have a second wave in 2015 whereby the uplinks (or backhauls) between the APs and the access switches will be multi-gigabit rates. The key requirement here is to be able to reuse the existing cabling infrastructure. So Cat 5e and Cat 6 would still be supported over the usual 100 meters and there would be no need to rip and replace cables.

Ethernet has come a long way since the days of the 2.94 Mbps flavor that Bob Metcalfe had invented. There is very little in common between the types of Ethernet standards we have today from the IEEE and the original specification. One thing that is common, however, is the ability to evolve according to market needs, from single-pair vehicular Ethernet to four-pair PoE and in between. More on this in another post.

3.2 Tbps on a single chip – Merchant silicon cranks it up

Recently I stumbled upon the blog of David Gee in the UK. He covered the Cavium acquisition of Xpliant as well as Broadcom’s announcement of the StrataXGS Tomahawk chipset less than two months later. The remarkable thing about both chipsets is that they are both capable of 3.2 Tbps and feature programmability, something which the Trident II (a 1.28 Tbps chipset) didn’t have. The Trident II is used on Cisco’s Nexus 9000, Juniper’s QFX5100, and HP’s 5930, to name a few switches. There had been great anticipation for the Trident II because it contains support for VXLAN, which the Trident did not. However, the most recent tunnel encapsulation protocol, Generic Network Virtualization Encapsulation (GENEVE), isn’t supported on Trident II. Well, with Tomahawk, as well as Xpliant, because of their programmable nature, they should, in theory.

Broadcom’s press announcement page contains an impressive array of quotes from vendors such as Brocade, Big Switch, Cumulus, HP, Juniper, Pica8, and VMware, to name a few. It remains to be seen what vendors will implement Xpliant.

Harnessing the Raw Performance of x86 – Snabb Switch

Recently I was listening to an episode of Ivan Pepeljnak’s Software Gone Wild podcast featuring Snabb Switch that inspired me to write this post. Snabb Switch is an open source program, developed by Luke Gorrie, for processing virtualized Ethernet traffic for white field deployments using x86 hardware. It caught my attention because the recent announcements of Intel’s networking capabilities at IDF14 were fresh in my mind. Snabb Switch is a networking framework that also defines different building blocks for I/O (such as input/Rx links and output/Tx links), Ethernet interfaces, and packet processing elements leveraging x86 servers and Intel NICs. It speaks natively to Ethernet hardware, Hypervisors, and the Linux kernel by virtue of a user-space executable. The cornerstone of Snabb Switch is its super light footprint, which enables it to process tens of millions of ethernet packets per second per core. Moreover, it has been known to push 200 Gbps on an x86 server. Pretty impressive for an open source program.

Snabb Switch uses the Lua programming language, which is a lightweight scripting language that can make some function calls and change the configuration in real time. It leverages LuaJit, a Just-In-Time compiler that compiles Lua code for x86 in real-time while switching packets. This technology is used in the video games industry as well as high frequency trading in the financial industry, but not very prevalent in the networking industry yet. The biggest exception is CloudFlare, the CDN that optimizes website delivery by blocking DOS attacks.

Snabb Switch rides the wave of the vast improvements in hardware performance on x86 servers and NICs. In a nutshell, networking applications on Linux have been moved out of the kernel and into user space. It used to be that each packet arriving from the network to the NIC of an x86-based Linux server would be sent up to the kernel, which would then have to wake up, via an Interrupt signal, and process them before sending them out on the network. This was a very time-consuming process and it also made it very difficult for application developers to write networking code because it involved intricate knowledge of the kernel. However, with faster hardware, developers realized that with so many packets arriving each microsecond, waking up the kernel to process each packet was too inefficient. Instead, it became more prudent to assume a continuous stream of packets and setting aside a dedicated pool of memory for this traffic. In other words, the NIC is mapped directly with the memory of the user process. Snabb Switch does this by writing their own driver for the NIC (Intel NICs for now) that drives features such as an embedded Ethernet switch and QoS on around 850 lines of Lua code.

Generally speaking, people with networking backgrounds have traditionally assumed x86-based servers to be limited in their packet-processing capabilities (attributed to PCI bus bottlenecks, slow memory, slow CPU, etc). In reality, the raw performance that can be extracted from x86-based hardware is quite high. 800 Gbps can be attained from DRAM banks, 600 Gbps can be attained from PCI Express, and the interconnect between CPUs is also hundreds of Gbps. There is no reason one cannot attain 500 Gbps using a dual core Xeon server. The bottleneck is quite clearly the software. Of course this works best (10 million packets per second per core) for simple cases such as just sending packets in and out. But for slightly more complicated scenarios, such as accessing an unpredictable address in memory, performance can drop by an order of magnitude.

Snabb Switch is known to have generated 200 Gbps out of a single core at just 10% CPU utilization, which is quite incredible. The way that Gorrie did this is by reading in 32,000 packets into a PCAP file, pushing them out on 20 10G NICs, and programming those ports to run in a loop.

The outcome of Snabb Switch is quite similar to Intel’s DPDK, in which there is user space-based forwarding, no Kernel interrupts, and CPUs are dedicated to particular NICs. However, Snabb Switch is a lightweight platform for ground up designs, whereas DPDK is intended to allow developers, who have written applications that run inside the kernel, to port their mature code to user space. For newer application designs, user space development is more prevalent because of the higher traffic levels and performance expectations. Snabb Switch modus operandi is to poll the kernel for new packets to process rather than interrupting it. It runs a scheduler in a polling loop with multiple parallel traffic processes on separate CPUs.

Snabb Switch can also run as a high performance NFV switch for OpenStack environments. The way it can do this is by removing the kernel from the forwarding path and allowing the user space program to talk directly to the device driver on the guest VM. The VMs are only able to address their own memory that they have allocated themselves. A software switch cannot allocate memory to a VM. Instead, for each VM, a separate TX/RX queue in hardware is provisioned in the NIC. So when a VM gives a buffer for packets, the buffer is translated from a standard virtio format (in KVM) directly to hardware format. In other words, when a packet comes in from the network, the NIC determines which VM should get it (typically by looking up the destination MAC address and VLAN ID), picks the appropriate hardware queue with memory that belongs to that VM, grabs a buffer and copies the data from the NIC to that VM. Since Snabb Switch acts as the translation engine between standard virtio and native hardware on the standard Intel NIC, there is no need to write or install a specific device driver for guest VMs to access the hardware.

I believe that Snabb Switch has a lot of promise though it may take a while for deployments to be more mainstream.