Tag Archives: virtualized networks

Harnessing the Raw Performance of x86 – Snabb Switch

Recently I was listening to an episode of Ivan Pepeljnak’s Software Gone Wild podcast featuring Snabb Switch that inspired me to write this post. Snabb Switch is an open source program, developed by Luke Gorrie, for processing virtualized Ethernet traffic for white field deployments using x86 hardware. It caught my attention because the recent announcements of Intel’s networking capabilities at IDF14 were fresh in my mind. Snabb Switch is a networking framework that also defines different building blocks for I/O (such as input/Rx links and output/Tx links), Ethernet interfaces, and packet processing elements leveraging x86 servers and Intel NICs. It speaks natively to Ethernet hardware, Hypervisors, and the Linux kernel by virtue of a user-space executable. The cornerstone of Snabb Switch is its super light footprint, which enables it to process tens of millions of ethernet packets per second per core. Moreover, it has been known to push 200 Gbps on an x86 server. Pretty impressive for an open source program.

Snabb Switch uses the Lua programming language, which is a lightweight scripting language that can make some function calls and change the configuration in real time. It leverages LuaJit, a Just-In-Time compiler that compiles Lua code for x86 in real-time while switching packets. This technology is used in the video games industry as well as high frequency trading in the financial industry, but not very prevalent in the networking industry yet. The biggest exception is CloudFlare, the CDN that optimizes website delivery by blocking DOS attacks.

Snabb Switch rides the wave of the vast improvements in hardware performance on x86 servers and NICs. In a nutshell, networking applications on Linux have been moved out of the kernel and into user space. It used to be that each packet arriving from the network to the NIC of an x86-based Linux server would be sent up to the kernel, which would then have to wake up, via an Interrupt signal, and process them before sending them out on the network. This was a very time-consuming process and it also made it very difficult for application developers to write networking code because it involved intricate knowledge of the kernel. However, with faster hardware, developers realized that with so many packets arriving each microsecond, waking up the kernel to process each packet was too inefficient. Instead, it became more prudent to assume a continuous stream of packets and setting aside a dedicated pool of memory for this traffic. In other words, the NIC is mapped directly with the memory of the user process. Snabb Switch does this by writing their own driver for the NIC (Intel NICs for now) that drives features such as an embedded Ethernet switch and QoS on around 850 lines of Lua code.

Generally speaking, people with networking backgrounds have traditionally assumed x86-based servers to be limited in their packet-processing capabilities (attributed to PCI bus bottlenecks, slow memory, slow CPU, etc). In reality, the raw performance that can be extracted from x86-based hardware is quite high. 800 Gbps can be attained from DRAM banks, 600 Gbps can be attained from PCI Express, and the interconnect between CPUs is also hundreds of Gbps. There is no reason one cannot attain 500 Gbps using a dual core Xeon server. The bottleneck is quite clearly the software. Of course this works best (10 million packets per second per core) for simple cases such as just sending packets in and out. But for slightly more complicated scenarios, such as accessing an unpredictable address in memory, performance can drop by an order of magnitude.

Snabb Switch is known to have generated 200 Gbps out of a single core at just 10% CPU utilization, which is quite incredible. The way that Gorrie did this is by reading in 32,000 packets into a PCAP file, pushing them out on 20 10G NICs, and programming those ports to run in a loop.

The outcome of Snabb Switch is quite similar to Intel’s DPDK, in which there is user space-based forwarding, no Kernel interrupts, and CPUs are dedicated to particular NICs. However, Snabb Switch is a lightweight platform for ground up designs, whereas DPDK is intended to allow developers, who have written applications that run inside the kernel, to port their mature code to user space. For newer application designs, user space development is more prevalent because of the higher traffic levels and performance expectations. Snabb Switch modus operandi is to poll the kernel for new packets to process rather than interrupting it. It runs a scheduler in a polling loop with multiple parallel traffic processes on separate CPUs.

Snabb Switch can also run as a high performance NFV switch for OpenStack environments. The way it can do this is by removing the kernel from the forwarding path and allowing the user space program to talk directly to the device driver on the guest VM. The VMs are only able to address their own memory that they have allocated themselves. A software switch cannot allocate memory to a VM. Instead, for each VM, a separate TX/RX queue in hardware is provisioned in the NIC. So when a VM gives a buffer for packets, the buffer is translated from a standard virtio format (in KVM) directly to hardware format. In other words, when a packet comes in from the network, the NIC determines which VM should get it (typically by looking up the destination MAC address and VLAN ID), picks the appropriate hardware queue with memory that belongs to that VM, grabs a buffer and copies the data from the NIC to that VM. Since Snabb Switch acts as the translation engine between standard virtio and native hardware on the standard Intel NIC, there is no need to write or install a specific device driver for guest VMs to access the hardware.

I believe that Snabb Switch has a lot of promise though it may take a while for deployments to be more mainstream.

Advertisements

Deconstructing Big Switch Networks at NFD8

I recently caught up with the presentation made by Big Switch Networks at Networking Field Day 8.

Founder Kyle Forster kicked things off with an introduction to Big Switch. He used the term ‘Hyperscale Data Centers’ to describe the data center of today and tomorrow that Big Switch targets. Big Switch has two products based on the following three principles:

  1. Bare metal switches for Hardware using Broadcom ASICs.
  2. Controller-based design for software
  3. Core-pod architecture replacing the traditional Core-Aggregation-Edge tier.

The two products are:

  1. Big Tap Monitoring Fabric – Taps for offline network monitoring
  2. Big Cloud Fabric – Clos switching fabric

Next up, CTO Rob Sherwood went into more detail. He defined the core-pod architecture as essentially a spine-leaf architecture where there are 16 pods (racks that have servers), at the top of which are two leaves. Each server is dual-connected to a leaf via 10G links. The leaves themselves connect up to spines via 40G links. The leaf switches are 48x10G and 6x40G for uplinks; the spine switches are 32x40G. So the maximum number of spine switches in a pod is 6. (In a leaf-spine fabric every leaf must connect to every spine). That also means a maximum of 32 leaves can connect to a spine. These numbers will definitely increase in future generations of switches when Broadcom can produce them at scale. This solution is targeted at Fortune 1000 companies, not really as much on smaller enterprises. Pods are very modular and can be replaced without disrupting the older network designs.

What I thought was pretty cool was the use of Open Network Installation Environment (ONIE) for provisioning. The switches get shipped to customers from Dell or Accton with a very lightweight OS, then as it turn on the box it netboots from the Controller (an ONIE server). Both Switch Light (which is the Big Switch OS), as well as the relevant configurations, get downloaded from the Controller to the switch. LLDP is used to auto-discover the links in the topology, and management software will tell if there are missing or double-connected links.

In the first demo, the Automated Fabric Visibility Tool was used to first allocate and assign roles in the topology. At that point, any errors in cabling would appear in the GUI, which was pretty user-friendly. The Big Cloud Fabric architecture has a dedicated OOB control/management network that connects to the Controller. Amongst the management primitives are a logical L2 segment (ala VLAN) that have logical ports and end-points, tenants that are logical grouping of L2/L3 networks, and logical routers that are the tenant routers for inter-segment or intra-segment routing. Each logical router corresponds to a VRF. VLAN tags can be mixed and matched and added into bridged domains. The use case would be analogous to a multi-tenant environment in each ESX instance, when you declare egress VLAN tags on vswitch in VMware deployments. You have the choice of specifying the tag as global fabric or local to the vswitch. Interestingly, Big Switch used to have an Overlay product two years ago and ended up tossing it away (because they feel they are L2 solutions only, not L3 solutions) to come up with the current solution because they believe it uses the hardware the way it was designed to be used.

The next demo was to create tenants, assign servers and VMs to logical segments by VLAN, physical ports, or port-groups to meet a use case of a common two-tier application.

The fabric in Big Cloud Fabric is analogous to a sheet metal chassis-based fabric that has fabric backplanes, line cards, and supervisors/management modules in that the spine switches are the backplanes, the leaf switches are the line cards, and the Controllers are the supervisors. The analogies actually don’t end with the components. Sherwood explained that traditional chassis switch vendors use proprietary protocols between their backplanes and their line cards that is actually Ethernet and is, therefore, no different from the OOB management network between spine switches and leaf switches. The control planes and the data planes in Big Cloud Fabric are completely decoupled so that in the event of the management switch completely going down, you only lose the ability to change and manage the network. So for example, if a new server comes up, routes for that host don’t get propagated. Of course, if both supervisors in a Nexus 7K go down, the whole switch stops working. If both Controllers go down simultaneously, the time needed to bring up a third Controller is about 5 minutes.

Big Cloud Fabric is based on OpenFlow with extensions. The white box switches that Big Switch sells have Broadcom ASICs that have several types of forwarding tables (programmable memory that can be populated). Some of the early implementations of OpenFlow only exposed the ACL table (which had only about 2000 entries), which didn’t scale well. The way Big Switch implements OpenFlow in Switch Light OS is to expose an L2 table and an L3 table, each with over 100,000 entries. They couldn’t go into more details as they were under NDA with Broadcom. Switch Light OS is Big Switch’s Indigo OpenFlow Agent running on Open Network Linux on x86 or ASIC-based hardware. Whereas traditional networks have clear L2/L3 boundaries in terms of devices, in Big Cloud Fabric L3 packets are routed on the first hop switch. If a tenant needs to talk to another tenant, packets go through a system router, which resides only on a spine switch.

Next up was Service Chaining and Network Functions Virtualization support. Service Chaining is implemented via next-hop forwarding. For example, at a policy level, if one VM or app needed to talk to another app, it could be passed through a service such as a load balancer or firewall (while leveraging the ECMP bits of the hardware) before reaching the destination. The demo showed how to create a policy and then, with a firewall service example, how to apply that policy, which is known as service insertion to an ECMP group. However, it is early days for this NFV use case and for more elaborate needs such as health monitoring, the recommendation is to use OpenStack. Interestingly, Big Switch integrates with OpenStack, but not VMware at this time (it is on the roadmap though).

Operational Simplicity was next on the agenda. Here Big Switch kicked off with the choice of control plane APIs to the user – CLI, GUI, or REST, which, generally speaking, would appeal to network engineers, vCenter administrators, and DevOps personnel respectively. High Availability is designed so that reactions to outages are localized as much as possible. For example, the loss of a spine only reduces capacity, the loss of a leaf is covered by the other leaf in the same rack (thanks to a dedicated link between the two) that has connections to the same servers (so the servers failover to the other leaf via LACP). The Hitless Upgrade process is truly hitless from an application perspective (a few milliseconds of data packets are lost) though capacity is reduced. A feature called Test Path shows the logical (at a policy level) as well as physical path a host takes to reach another host.

The final session was on the Network Monitoring features of Big Switch, namely Big Tap Monitoring Fabric. Sunit Chauhan, head of Product Marketing, said that the monitoring infrastructure is developed using the same bare metal switches that is managed from the same centralized controller. The goal is to monitor and tap every rack and ideally every vswitch. In a campus network that means the ability to filter traffic from all locations to the tools. The Big Tap Monitoring Controller is separate from the Big Cloud Fabric Controller and runs Switch Light OS as well. The example he gave was of a large mobile operator in Japan that needed thousands of taps. The only scalable (in terms of cost and performance) solution to to monitor such a network was to use bare metal Ethernet switches that report to a centralized SDN Controller.

The Big Tap Monitoring demo was based off a common design with a production network (which could be Big Cloud Fabric or traditional L2/L3 networks) with filter ports connected to a Big Tap Controller, which was then connected via delivery ports to the Tool Farm, where all the visibility tools existed.Of course Big Switch eats its own dogfood like every noble startup by deploying Big Tap Monitoring Fabric in its own office. They were able to capture the actual video stream of the NFD event that went out to the Internet from their office. Remote Data Center Monitoring is also supported now (though not demonstrable at NFD8), which reminded me of RSPAN except that this used L2-GRE tunnels.

A few afterthoughts: Big Switch used the marketing term ‘hyperscale data center’ like it was an industry standard and they gave examples of networks that were not hyperscale without explaining how they weren’t. In fact there was a slide that was dedicated to terminology used in a demo, but ‘hyperscale’ was not there. It reminded me of my days in a startup that used that same term in its marketing headlines without ever defining it.

From a personal perspective, in 2010 I worked as a Network Engineer in a large financial exchange where the Big Tap Monitoring Fabric would have been invaluable. Any time a trade was delayed by a couple of seconds resulted in potentially millions of dollars. The exchange would be spared that penalty if it could be proved that the delay was due to the application or the remote network and not the exchange’s own network. At that time we used network monitoring switches to determine where the delay occurred. But the location of those taps was critical. Moreover, it was just not scalable to have taps at every location off of every port. Since it was a reactive (troubleshooting) effort, it was really a Whac-a-Mole exercise. Ultimately, we went with a vendor that built the infrastructure to collect latency data from exchanges, and then offered the results to firms to allow them to monitor data and order execution latency on those markets. But it was expensive and those investments were between $10 and $15 million and ultimately that vendor went out of business. A solution like Big Tap Monitoring Fabric would have been a godsend. If Big Switch can figure out how to keep their costs down, they may have a huge opportunity in hand.

Tech Field Day events are intended to be very technical and this was no different. Slides with Gartner Magic Quadrants are usually met with rolling eyeballs, but I think Big Switch can be forgiven for having one reference to an industry analyst. Apparently according to Dell ‘Oro, in 2013 more ports were shipped from bare metal switches (from vendors such as Dell, Accton, and Quanta) that from Arista, Juniper, and Extreme combined!

While Big Cloud Fabric competes against the Cisco Nexus 7K product line, Big Tap Monitoring goes head to head against Gigamon. It was very refreshing to see a startup take on two behemoths, sheerly with clever engineering and nimble principles.