Category Archives: Ethernet

White Box switch readiness for prime time

Matthew Stone runs Cumulus Networks switches in his production network. He came on the Software Gone Wild podcast recently to talk about his experiences. Cumulus, Pica8, and Big Switch are the three biggest proponents of white box switching. While Pica8 focuses on the Linux abstractions for L2/L3, Pica8 focuses more on the OpenFlow implementation, and Big Switch on leveraging white boxes to form taps and, more recently, piecing together leaf-spine fabric pods.

I believe white box switches are years away from entering campus networks. Even managed services are not close. You won’t see a Meraki-style deployment of these white box switches in closets for a while. But Stone remains optimistic and makes solid points as an implementer. My favorite part is when he describes how Cumulus has rewritten the ifupdown script, to simplify configuration for network switches (which typically are roughly 50 ports as compared to 4-port servers), and repackaged it as ifupdown2 to the Debian distribution. Have a listen.

NBASE-T or MGBASE-T?

Last week I wrote about five new speeds that the Ethernet Alliance (the marketing arm of IEEE) are working on. The lower speeds 2.5 Gbps and 5 Gbps are called MGBASE-T and according to this post from the Ethernet Alliance, the MGBASE-T Alliance is overseeing the development of these standards outside of IEEE. This week, news broke out about leading PHY vendor Aquantia teaming up with Cisco, Freescale, and Xilinx to form the NBASE-T Alliance. This raises some questions about the work and causes that the MGBASE-T Alliance and NBASE-T Alliance are committed to.

Both NBASE-T and MGBASE-T are trademarks of Aquantia. Both the MGBASE-T Alliance and the NBASE-T Alliance are Delaware corporations. It appears as though the MGBASE-T Alliance was formed around June 2014, while NBASE-T Alliance is newer, September 2014.

The NBASE-T Alliance website defines the technology as follows:

NBASE-T™  is a proven technology boosting the speed of twisted pair copper cabling up to 100 meters in length well beyond the designed limits of 1 Gbps.

Capable of reaching 2.5 and 5 Gigabits per second over 100m of Cat 5e cable, the disruptive NBASE-T solution allows a new type of signaling over twisted-pair cabling. Should the silicon have the capability, auto-negotiation can allow the NBASE-T solution to accurately select the best speed: 100 Megabit Ethernet (100MbE), 1 Gigabit Ethernet (GbE), 2.5 Gigabit Ethernet (2.5GbE) and 5 Gigabit Ethernet (5GbE).

So what happens to MGBASE-T given that Aquantia was a part of both? My hunch is that it fizzles away and the other vendors who were working on it (no names here) lost in the race to Cisco, Freescale, and Xilinx.

Ethernet Alliance unveils five new speeds

This week Network World laid out some details of the work the IEEE group, the Ethernet Alliance, is doing with respect to new data rates. As mentioned in this blog post, while there are 5 shipping speeds of Ethernet (100 Mbps, 1 Gbps, 10 Gbps, 40 Gbps, and 100 Gbps), there are 5 new speeds that are currently being worked on (2.5 Gbps, 5 Gbps, 25 Gbps, 50 Gbps, and 400 Gbps). The last time Ethernet got this sexy was when promiscuous mode was introduced.

Some of the drivers for these new speeds are adoption rates of the older speeds. As detailed in the July 2014 IEEE Call for Interest , while the initial adoption for 10G, 40G, and 100G was in 2004, 2012, and 2015 (anticipated) respectively, because these speeds are turning out to be cost prohibitive, the transition to higher speeds has been slower than previously forecasted. For example, the 1G -> 10G transition has repeatedly moved out (from 2012 to 2014 to 2016 now). This creates a window where new technology can provide the higher port speed at lower cost. So, as an example, the SFP+ technology can be leveraged in 25 Gbps as a single lane and 50 Gbps as two lanes.

The 2.5 and 5 Gbps speeds (known as MGBASE-T) address the growing demands of BYOD in campus networks. Many of the newer APs nowadays ship with 802.11ac. This Wifi standard will have a second wave in 2015 whereby the uplinks (or backhauls) between the APs and the access switches will be multi-gigabit rates. The key requirement here is to be able to reuse the existing cabling infrastructure. So Cat 5e and Cat 6 would still be supported over the usual 100 meters and there would be no need to rip and replace cables.

Ethernet has come a long way since the days of the 2.94 Mbps flavor that Bob Metcalfe had invented. There is very little in common between the types of Ethernet standards we have today from the IEEE and the original specification. One thing that is common, however, is the ability to evolve according to market needs, from single-pair vehicular Ethernet to four-pair PoE and in between. More on this in another post.

3.2 Tbps on a single chip – Merchant silicon cranks it up

Recently I stumbled upon the blog of David Gee in the UK. He covered the Cavium acquisition of Xpliant as well as Broadcom’s announcement of the StrataXGS Tomahawk chipset less than two months later. The remarkable thing about both chipsets is that they are both capable of 3.2 Tbps and feature programmability, something which the Trident II (a 1.28 Tbps chipset) didn’t have. The Trident II is used on Cisco’s Nexus 9000, Juniper’s QFX5100, and HP’s 5930, to name a few switches. There had been great anticipation for the Trident II because it contains support for VXLAN, which the Trident did not. However, the most recent tunnel encapsulation protocol, Generic Network Virtualization Encapsulation (GENEVE), isn’t supported on Trident II. Well, with Tomahawk, as well as Xpliant, because of their programmable nature, they should, in theory.

Broadcom’s press announcement page contains an impressive array of quotes from vendors such as Brocade, Big Switch, Cumulus, HP, Juniper, Pica8, and VMware, to name a few. It remains to be seen what vendors will implement Xpliant.

Harnessing the Raw Performance of x86 – Snabb Switch

Recently I was listening to an episode of Ivan Pepeljnak’s Software Gone Wild podcast featuring Snabb Switch that inspired me to write this post. Snabb Switch is an open source program, developed by Luke Gorrie, for processing virtualized Ethernet traffic for white field deployments using x86 hardware. It caught my attention because the recent announcements of Intel’s networking capabilities at IDF14 were fresh in my mind. Snabb Switch is a networking framework that also defines different building blocks for I/O (such as input/Rx links and output/Tx links), Ethernet interfaces, and packet processing elements leveraging x86 servers and Intel NICs. It speaks natively to Ethernet hardware, Hypervisors, and the Linux kernel by virtue of a user-space executable. The cornerstone of Snabb Switch is its super light footprint, which enables it to process tens of millions of ethernet packets per second per core. Moreover, it has been known to push 200 Gbps on an x86 server. Pretty impressive for an open source program.

Snabb Switch uses the Lua programming language, which is a lightweight scripting language that can make some function calls and change the configuration in real time. It leverages LuaJit, a Just-In-Time compiler that compiles Lua code for x86 in real-time while switching packets. This technology is used in the video games industry as well as high frequency trading in the financial industry, but not very prevalent in the networking industry yet. The biggest exception is CloudFlare, the CDN that optimizes website delivery by blocking DOS attacks.

Snabb Switch rides the wave of the vast improvements in hardware performance on x86 servers and NICs. In a nutshell, networking applications on Linux have been moved out of the kernel and into user space. It used to be that each packet arriving from the network to the NIC of an x86-based Linux server would be sent up to the kernel, which would then have to wake up, via an Interrupt signal, and process them before sending them out on the network. This was a very time-consuming process and it also made it very difficult for application developers to write networking code because it involved intricate knowledge of the kernel. However, with faster hardware, developers realized that with so many packets arriving each microsecond, waking up the kernel to process each packet was too inefficient. Instead, it became more prudent to assume a continuous stream of packets and setting aside a dedicated pool of memory for this traffic. In other words, the NIC is mapped directly with the memory of the user process. Snabb Switch does this by writing their own driver for the NIC (Intel NICs for now) that drives features such as an embedded Ethernet switch and QoS on around 850 lines of Lua code.

Generally speaking, people with networking backgrounds have traditionally assumed x86-based servers to be limited in their packet-processing capabilities (attributed to PCI bus bottlenecks, slow memory, slow CPU, etc). In reality, the raw performance that can be extracted from x86-based hardware is quite high. 800 Gbps can be attained from DRAM banks, 600 Gbps can be attained from PCI Express, and the interconnect between CPUs is also hundreds of Gbps. There is no reason one cannot attain 500 Gbps using a dual core Xeon server. The bottleneck is quite clearly the software. Of course this works best (10 million packets per second per core) for simple cases such as just sending packets in and out. But for slightly more complicated scenarios, such as accessing an unpredictable address in memory, performance can drop by an order of magnitude.

Snabb Switch is known to have generated 200 Gbps out of a single core at just 10% CPU utilization, which is quite incredible. The way that Gorrie did this is by reading in 32,000 packets into a PCAP file, pushing them out on 20 10G NICs, and programming those ports to run in a loop.

The outcome of Snabb Switch is quite similar to Intel’s DPDK, in which there is user space-based forwarding, no Kernel interrupts, and CPUs are dedicated to particular NICs. However, Snabb Switch is a lightweight platform for ground up designs, whereas DPDK is intended to allow developers, who have written applications that run inside the kernel, to port their mature code to user space. For newer application designs, user space development is more prevalent because of the higher traffic levels and performance expectations. Snabb Switch modus operandi is to poll the kernel for new packets to process rather than interrupting it. It runs a scheduler in a polling loop with multiple parallel traffic processes on separate CPUs.

Snabb Switch can also run as a high performance NFV switch for OpenStack environments. The way it can do this is by removing the kernel from the forwarding path and allowing the user space program to talk directly to the device driver on the guest VM. The VMs are only able to address their own memory that they have allocated themselves. A software switch cannot allocate memory to a VM. Instead, for each VM, a separate TX/RX queue in hardware is provisioned in the NIC. So when a VM gives a buffer for packets, the buffer is translated from a standard virtio format (in KVM) directly to hardware format. In other words, when a packet comes in from the network, the NIC determines which VM should get it (typically by looking up the destination MAC address and VLAN ID), picks the appropriate hardware queue with memory that belongs to that VM, grabs a buffer and copies the data from the NIC to that VM. Since Snabb Switch acts as the translation engine between standard virtio and native hardware on the standard Intel NIC, there is no need to write or install a specific device driver for guest VMs to access the hardware.

I believe that Snabb Switch has a lot of promise though it may take a while for deployments to be more mainstream.

IDF14 – Will Bare Metal Servers Obviate Bare Metal Switches?

Recently I wrote about the Networking Field Day 8 presentations on Nuage Networks and Big Switch Networks. A noticeable absentee at Networking Field Day 8 was the co-host of the popular Packet Pushers show, Greg Ferro. What was so important that kept Mr. Ferro away from NFD8? Well, it turns out that he was attending Intel Developer Forum 2014 and discussed his findings on his show – The Network Break (I guess you can call me a Greg Ferro stalker). This prompted me to dig a bit deeper into Intel’s Software Defined Infrastructure vision and what I think it means to the networking industry.

Intel’s announcements included new products such as the XL710 controller and E5 chipset, and technologies, such as QuickAssist Network Acceleration APIs and Data Plane Development Kit (DPDK).

NFV and Intel DPDKDPDK has actually been around since 2010. As defined on its website, it is a set of libraries and drivers for fast packet processing on x86 platforms. It runs mostly in Linux userland. This allows for higher levels of packet processing throughput than what is achievable using the standard Linux kernel network stack. In fact, according to these slides, it can achieve a 25X improvement in per core L3 packet performance over standard Linux. Using DPDK, the latest Intel chips can support Geneve, which is a highly extensible UDP encapsulation for overlays. Geneve claims to perform flexible packet matching of any type of tunnel protocol (such as VXLAN and NVGRE). Within the Geneve Header is an Options field that can contain metadata and context, which is invaluable for NFV & service chaining). So, it is not surprising that Intel has a partnership with VMware (the champions of overlay networks) that is catered to NFV solutions.

Intel XL710The Ethernet Controller XL710 is 40 Gbps ready on a single virtual core, 160 Gbps per CPU socket. It can terminate Geneve tunnels at line rate (39.39 Gbps on the 40 Gbps adapters) as the IDF14 demos showed. The reason it can do this is because of Receive Side Scaling for VXLAN, which balances CPU utilization across cores.

Intel E5 with QuickAssistSome of the packet/security/compression acceleration features of the E5-2600 v3 chipsets, powered by QuickAssist technology (which does stateless offload and protocol acceleration), are 100 Gbps SSL Termination (a boon for SEO), 160 Kops (key operations for IPSec), 80 Gbps Platform compression (applicable for Big Data analytics like Hadoop), which should keep pure play networking vendors (including firewall, VPN Concentrators, and load balancer) on their toes.

The overlay vs underlay network debate has become a hot topic in recent years, perhaps best exemplified by the Cisco ACI vs VMware NSX solutions. VMware believes that overlays on top of bare metal servers running X86 chips are the way of the future. They believe that protocol offload technologies like QuickAssist are the solution for building scalable infrastructures. Pure play vendors like Cisco believe that there is still value in custom networking ASICs on switches that form the underlays. Still other networking startup vendors like Pica8, Cumulus Networks, and Big Switch Networks are the poster children of bare metal switches, i.e. switches that leverage merchant silicon, such as Broadcom or Marvell, and whose sheet metal is assembled by white box vendors such as Celestica, Delta Networks, Acton, or Quanta.

How will Intel’s recent announcements affect networking vendors? Well, network virtualization poses very different challenges from server virtualization. Protocol offload has been around for several years, but isn’t as ubiquitous as you’d think: I ran into performance issues first hand with TCP offload in 2011 when disabling it used to give much better results. And simply slapping an overlay on top doesn’t solve every networking problem. Scaling in network virtualization is far more difficult than scaling in server virtualization. For example, the number of ACLs needed grows quadratically as the number of web servers or database servers increases linearly. I think the future is still bright for bare metal switch vendors, but I would love to hear back from you.

Introducing the HP 5400R zl2 Switch Series

I’m very proud to launch the HP 5400R zl2 Switch Series at HP Discover this week in Las Vegas. I am the Product Manager of this switch, which is a line extension to the HP 5400 zl Switch Series.

The 5400R offers enterprise-class resiliency via redundant management and redundant power. Like the HP 5400 zl and HP 8200 zl switch series, it is available in 6-slot and 12-slot chassis, and as a base switch as well as in five bundles with v2 modules. A new management module offers non-stop switching and hitless failover. The nice thing about this capability is that customers are not bound to the chassis type up front. If they decide on redundancy later on, they can attain it simply by adding a second management module.

Three new power supplies are introduced that offer N+1 and N+N redundancy. Moreover, full IEEE 802.3at PoE+ power (30W per port) can be supplied to a maximum of 288 ports simultaneously.

2014-04-18 17.49.04
With a production-grade HP 5406R zl2 switch, less than two months before launching it.

The HP 5400R zl2 switch is the only modular (chassis) switch available at the price of a stackable switch. It outperforms the Cisco Catalyst 4500 in nearly every category and comes with HP Networking’s renowned hardware Lifetime Warranty (and 3 years of free software support). Add to that the rich OpenFlow 1.3 capabilities that are offered by the custom ProVision ASIC (with support for SDN applications such as Network Optimizer and Network Protector to name a couple) and you have what it takes to beat Cisco in the Campus.