The Etymology of Elephant and Mice Flows

elephant flowOver the past 3-4 years, the term elephant flows has been used to refer to east-west (machine-to-machine) traffic, such as vMotion, Migration, Backup, and Replication. The term mice flows is used to refer to north-south (user-to-machine) traffic. Why are we using these terms all of a sudden and did they come from?

Wikipedia statesIt is not clear who coined “elephant flow”, but the term began occurring in published Internet network research in 2001 when the observations were made that a small number of flows carry the majority of Internet traffic and the remainder consists of a large number of flows that carry very little Internet traffic”.

The traffic that traverses Data Center Interconnects (DCI) is typically east-west and flow-oriented (TCP-based). These applications have huge bandwidth requirements when compared to north-south. RFC 1028 defines a term LFN (Long Fat Network), which is when the Bandwidth Delay Product (BDP) is 105 bits or 12500 bytes. BDP and LFN have existed in the world of WAN Optimization (traditionally for north-south traffic) for over a decade. It is only more recently in the era of east-west traffic in DCI that elephant flows have become more prominent. The terms remain even within a data center, as the folks from VMware have shown in this well-written piece from exactly a year ago.

Advertisements

White Box switch readiness for prime time

Matthew Stone runs Cumulus Networks switches in his production network. He came on the Software Gone Wild podcast recently to talk about his experiences. Cumulus, Pica8, and Big Switch are the three biggest proponents of white box switching. While Pica8 focuses on the Linux abstractions for L2/L3, Pica8 focuses more on the OpenFlow implementation, and Big Switch on leveraging white boxes to form taps and, more recently, piecing together leaf-spine fabric pods.

I believe white box switches are years away from entering campus networks. Even managed services are not close. You won’t see a Meraki-style deployment of these white box switches in closets for a while. But Stone remains optimistic and makes solid points as an implementer. My favorite part is when he describes how Cumulus has rewritten the ifupdown script, to simplify configuration for network switches (which typically are roughly 50 ports as compared to 4-port servers), and repackaged it as ifupdown2 to the Debian distribution. Have a listen.

NBASE-T or MGBASE-T?

Last week I wrote about five new speeds that the Ethernet Alliance (the marketing arm of IEEE) are working on. The lower speeds 2.5 Gbps and 5 Gbps are called MGBASE-T and according to this post from the Ethernet Alliance, the MGBASE-T Alliance is overseeing the development of these standards outside of IEEE. This week, news broke out about leading PHY vendor Aquantia teaming up with Cisco, Freescale, and Xilinx to form the NBASE-T Alliance. This raises some questions about the work and causes that the MGBASE-T Alliance and NBASE-T Alliance are committed to.

Both NBASE-T and MGBASE-T are trademarks of Aquantia. Both the MGBASE-T Alliance and the NBASE-T Alliance are Delaware corporations. It appears as though the MGBASE-T Alliance was formed around June 2014, while NBASE-T Alliance is newer, September 2014.

The NBASE-T Alliance website defines the technology as follows:

NBASE-T™  is a proven technology boosting the speed of twisted pair copper cabling up to 100 meters in length well beyond the designed limits of 1 Gbps.

Capable of reaching 2.5 and 5 Gigabits per second over 100m of Cat 5e cable, the disruptive NBASE-T solution allows a new type of signaling over twisted-pair cabling. Should the silicon have the capability, auto-negotiation can allow the NBASE-T solution to accurately select the best speed: 100 Megabit Ethernet (100MbE), 1 Gigabit Ethernet (GbE), 2.5 Gigabit Ethernet (2.5GbE) and 5 Gigabit Ethernet (5GbE).

So what happens to MGBASE-T given that Aquantia was a part of both? My hunch is that it fizzles away and the other vendors who were working on it (no names here) lost in the race to Cisco, Freescale, and Xilinx.

Head End Replication and VXLAN Compliance

Arista Networks recently announced that its implementation of VXLAN no longer requires IP Multicast in the underlay network. Instead, the implementation will now rely on a technique called Head End Replication to forward BUM (Broadcast, Unknown Unicast, and Multicast) traffic in the VLANs that it transports. But first, let’s rewind to the original VXLAN specification.

Virtual eXtensible Local Area Networks were first defined in an Internet draft called draft-mahalingam-dutt-dcops-vxlan-00.txt in August 2011. It took some time for switch vendors to implement it, but now Broadcom’s Trident II supports it. Of course, software overlay solutions such as VMware NSX and Nuage Virtualized Services Platform (VSP) also implement it. Three years later, in August 2014, this draft became RFC 7348. The draft had 9 revisions to it, so it went up to draft-mahalingam-dutt-dcops-vxlan-09.txt, but there are no significant changes with respect to Multicast requirements in the underlay. They all say the same thing in section 4.2:

Consider the VM on the source host attempting to communicate with the destination VM using IP.  Assuming that they are both on the same subnet, the VM sends out an Address Resolution Protocol (ARP) broadcast frame. In the non-VXLAN environment, this frame would be sent out using MAC broadcast across all switches carrying that VLAN.

With VXLAN, a header including the VXLAN VNI is inserted at the beginning of the packet along with the IP header and UDP header. However, this broadcast packet is sent out to the IP multicast group on which that VXLAN overlay network is realized. To effect this, we need to have a mapping between the VXLAN VNI and the IP multicast group that it will use.

In essence, IP multicast is the control plane in VXLAN. But, as we know, IP multicast is very complex to configure and manage.

In June 2013, Cisco deviated from the VXLAN standard in the Nexus 1000V in two ways:

  1. It makes copies of packets for each possible IP address at which the destination MAC address can be found, and sent from the head-end of the VXLAN tunnel, or VLAN Tunnel End Point (VTEP). Then these packets are unicast to all VMs within the VXLAN segment, thereby precluding the need to have IP multicast in the core of the network.
  2. The Virtual Supervisor Module (VSM) of the Nexus 1000V acts as the control plane by maintaining the MAC address table of the VMs, which it then distributes, via a proprietary signaling protocol, to the Virtual Ethernet Module (VEM), which, in turn, acts as the data plane in the Nexus 1000V.

To their credit Cisco acknowledged that this mode is not compliant with the standard, although they do support a multicast-mode configuration as well. At that time they expressed hope that the rest of the industry would back their solution. Well, the RFC still states that an IP multicast backbone is needed.

This brings me to the original announcement from Arista. They claim in their press statementThe Arista VXLAN implementation is truly open and standards based with the ability to interoperate with a wide range of data center switches.

But nowhere else on their website do they state how they actually adhere to the standard. Cisco breaks the standard by conducting Head End Replication. Adam Raffe does a great job in explaining how this works (basically, the source VTEP will replicate the Broadcast or Multicast packet and send to all VMs in the same VXLAN). Arista should explain how exactly their enhanced implementation works.

Linux as a Switch Operating System: Five Lessons Learned

Although this post is nearly a year old, it is still gold. Ken Duda, the CTO of Arista Networks described five lessons learned along the way of supporting Enterprise Operating System (EOS), the Linux-based switching operating system. They are listed as:

  1. It’s okay to leave the door unlocked.
  2. Preserve the integrity of the Linux core.
  3. Focus on state, not messages.
  4. Keep your hands out of the kernel.
  5. Provide familiar interfaces to ease adoption.

Definitely worth a read.

What the world outside of IT can learn from open source

Earlier this week, the world’s leading drugmaker Johnson and Johnson (J & J) announced that it would join hands with rival GlaxoSmithKline (GSK) to develop a vaccine to combat the Ebola disease. Apparently, both companies had been working on a vaccine, but now they are collaborating.

Yawn. Tech companies have been doing that for decades, since the early days of Linux. It’s called Open Source, people. And it’s a beautiful thing. When competitors get together to come up with solutions, obviously much of it is for publicity, but much good does come out of it. The world would be a much better place if other major corporations would follow suit for a change, and come up with ideas together to solve real world problems.

Ethernet Alliance unveils five new speeds

This week Network World laid out some details of the work the IEEE group, the Ethernet Alliance, is doing with respect to new data rates. As mentioned in this blog post, while there are 5 shipping speeds of Ethernet (100 Mbps, 1 Gbps, 10 Gbps, 40 Gbps, and 100 Gbps), there are 5 new speeds that are currently being worked on (2.5 Gbps, 5 Gbps, 25 Gbps, 50 Gbps, and 400 Gbps). The last time Ethernet got this sexy was when promiscuous mode was introduced.

Some of the drivers for these new speeds are adoption rates of the older speeds. As detailed in the July 2014 IEEE Call for Interest , while the initial adoption for 10G, 40G, and 100G was in 2004, 2012, and 2015 (anticipated) respectively, because these speeds are turning out to be cost prohibitive, the transition to higher speeds has been slower than previously forecasted. For example, the 1G -> 10G transition has repeatedly moved out (from 2012 to 2014 to 2016 now). This creates a window where new technology can provide the higher port speed at lower cost. So, as an example, the SFP+ technology can be leveraged in 25 Gbps as a single lane and 50 Gbps as two lanes.

The 2.5 and 5 Gbps speeds (known as MGBASE-T) address the growing demands of BYOD in campus networks. Many of the newer APs nowadays ship with 802.11ac. This Wifi standard will have a second wave in 2015 whereby the uplinks (or backhauls) between the APs and the access switches will be multi-gigabit rates. The key requirement here is to be able to reuse the existing cabling infrastructure. So Cat 5e and Cat 6 would still be supported over the usual 100 meters and there would be no need to rip and replace cables.

Ethernet has come a long way since the days of the 2.94 Mbps flavor that Bob Metcalfe had invented. There is very little in common between the types of Ethernet standards we have today from the IEEE and the original specification. One thing that is common, however, is the ability to evolve according to market needs, from single-pair vehicular Ethernet to four-pair PoE and in between. More on this in another post.