Recently I had the treat of listening to two Layer 3 routing protocol maestros when the CTO of the startup Viptela, Khalid Raza, appeared on Ivan Pepelnjak’s Software Gone Wild podcast. Interestingly, the first time I had ever heard of Khalid or Ivan was through the Cisco Press books that they each authored. Ivan had the famous ‘MPLS and VPN Architectures‘ and Khalid, one of the first CCIEs, wrote ‘CCIE Professional Development: Large Scale IP Network Solutions‘, (which I owned an autographed copy of).
In a nutshell, Viptela’s Secure Extensible Network (SEN) creates hybrid connectivity (VPNs) across the WAN. Their target market is any large retailer or financial company, that has many branches. Khalid and the founder Amir Khan (of Juniper MX product line fame), come from super strong Layer 3 background and, consequently, they don’t purport to have a revolutionary solution. Instead, they have harnessed that background to improve on what DMVPN has been attempting to solve for the past 10 years. In Khalid’s words, they have “evolved MPLS concepts for overlay networks”.
Viptela SEN comprises a controller, VPN termination endpoints, and a proprietary protocol that is heavily inspired by BGP. In fact, one of the advisors of Viptela is Tony Li, author of 29 RFCs (mostly BGP-related), and one of the main architects of BGP. Viptela SEN can discover local site characteristics (such as the IGP) and report them to the controller, which then determines the branch’s connectivity policy. So it essentially reduces the number of control planes, which reduces the number of configurations for the WAN. This looks incredibly similar to what DMVPN sought out to do a decade ago. Viptela calls these endpoints dataplane points, but they still run routing protocols, so to me they’re just routers.
DMVPN, itself, started as a Cisco proprietary solution, spearheaded by Cisco TAC, in particular a gentleman by the name of Mike Sullenberger, who served as an escalation engineer. He has since coauthored an IETF draft on DMVPN. In fact, one of the earliest tech docs on cisco.com touts how ‘for a 1000-site deployment, DMVPN reduces the configuration effort at the hub from 3900 lines to 13 lines’.
Getting back to Viptela SEN, the endpoints (aka routers) authenticate with the controller (through exchange of certificates). Different circuits from different providers (MPLs or broadband) can be balanced through L3 ECMP. Their datapath endpoints are commodity boxes with Cavium processors that can give predictable (AES-256) encryption performance that tunnel to other endpoints (via peer-to-peer keys) as prescribed by the orchestrator/controller. In the event of a site-controller failures, if a site still has dataplane connectivity to another site that it needs to communicate with, then traffic can still forward (provided the keys are still valid) and all is well though the entries are stale.
One of the differentiators between Viptela and others in this space is that they do not build overlay subnet-based routing adjacencies. This allows them to offer each line of business in a large company to have a network topology that is service driven rather than the other way round. Translated in technical terms, each line of business effectively has a VRF with different default routes, but a single peering connection to the controller. In DMVPN terms, the controller is like the headend router, or hub. The biggest difference that I could tell between Viptela SEN and DMVPN is the preference given to L3 BGP over L2 NHRP. One of the biggest advantages of BGP has always been the outbound attribute change in the sense that a hub router could manipulate, via BGP MED, how a site could exit an AS. It is highly customizable. For example, majority of the sites could exit via a corporate DMZ while some branches (like Devtest in an AWS VPC) could exit through a regional exit point. In DMVPN, NHRP (which is a L2 ARP-like discovery protocol) has more authority and doesn’t allow outbound attribute manipulation which BGP, a L3 routing protocol has been doing successfully throughout the Internet for decades. NHRP just isn’t smart enough to provide that level of control-plane complexity.
Viptela SEN allows for each site to have different control policies – it could be a control plane path that says
The flexibility that Viptela SEN extends to a site can be at a control plane path level (e.g. ensure that certain VPNs trombone through a virtual path or service point like a firewall or IDS before exiting, as done in NFV with service chaining ) or data plane level (e.g. PBR). Since it promises easy bring-up and configuration, to alleviate concerns about SOHO endpoint boxes being stolen, they have a GPS installed in these lower end boxes. The controller only allows these boxes to authenticate with it if they are in the prescribed GPS coordinates. If the box is moved, it is flagged as a potentially unauthorized move and second-factor authentication is required in order to be considered as permissible. The controller can permit this but silently monitor the activities of this new endpoint box without its knowledge, akin to a honeypot. That’s innovation!
Founder Kyle Forster kicked things off with an introduction to Big Switch. He used the term ‘Hyperscale Data Centers’ to describe the data center of today and tomorrow that Big Switch targets. Big Switch has two products based on the following three principles:
Bare metal switches for Hardware using Broadcom ASICs.
Controller-based design for software
Core-pod architecture replacing the traditional Core-Aggregation-Edge tier.
The two products are:
Big Tap Monitoring Fabric – Taps for offline network monitoring
Big Cloud Fabric – Clos switching fabric
Next up, CTO Rob Sherwood went into more detail. He defined the core-pod architecture as essentially a spine-leaf architecture where there are 16 pods (racks that have servers), at the top of which are two leaves. Each server is dual-connected to a leaf via 10G links. The leaves themselves connect up to spines via 40G links. The leaf switches are 48x10G and 6x40G for uplinks; the spine switches are 32x40G. So the maximum number of spine switches in a pod is 6. (In a leaf-spine fabric every leaf must connect to every spine). That also means a maximum of 32 leaves can connect to a spine. These numbers will definitely increase in future generations of switches when Broadcom can produce them at scale. This solution is targeted at Fortune 1000 companies, not really as much on smaller enterprises. Pods are very modular and can be replaced without disrupting the older network designs.
What I thought was pretty cool was the use of Open Network Installation Environment (ONIE) for provisioning. The switches get shipped to customers from Dell or Accton with a very lightweight OS, then as it turn on the box it netboots from the Controller (an ONIE server). Both Switch Light (which is the Big Switch OS), as well as the relevant configurations, get downloaded from the Controller to the switch. LLDP is used to auto-discover the links in the topology, and management software will tell if there are missing or double-connected links.
In the first demo, the Automated Fabric Visibility Tool was used to first allocate and assign roles in the topology. At that point, any errors in cabling would appear in the GUI, which was pretty user-friendly. The Big Cloud Fabric architecture has a dedicated OOB control/management network that connects to the Controller. Amongst the management primitives are a logical L2 segment (ala VLAN) that have logical ports and end-points, tenants that are logical grouping of L2/L3 networks, and logical routers that are the tenant routers for inter-segment or intra-segment routing. Each logical router corresponds to a VRF. VLAN tags can be mixed and matched and added into bridged domains. The use case would be analogous to a multi-tenant environment in each ESX instance, when you declare egress VLAN tags on vswitch in VMware deployments. You have the choice of specifying the tag as global fabric or local to the vswitch. Interestingly, Big Switch used to have an Overlay product two years ago and ended up tossing it away (because they feel they are L2 solutions only, not L3 solutions) to come up with the current solution because they believe it uses the hardware the way it was designed to be used.
The next demo was to create tenants, assign servers and VMs to logical segments by VLAN, physical ports, or port-groups to meet a use case of a common two-tier application.
The fabric in Big Cloud Fabric is analogous to a sheet metal chassis-based fabric that has fabric backplanes, line cards, and supervisors/management modules in that the spine switches are the backplanes, the leaf switches are the line cards, and the Controllers are the supervisors. The analogies actually don’t end with the components. Sherwood explained that traditional chassis switch vendors use proprietary protocols between their backplanes and their line cards that is actually Ethernet and is, therefore, no different from the OOB management network between spine switches and leaf switches. The control planes and the data planes in Big Cloud Fabric are completely decoupled so that in the event of the management switch completely going down, you only lose the ability to change and manage the network. So for example, if a new server comes up, routes for that host don’t get propagated. Of course, if both supervisors in a Nexus 7K go down, the whole switch stops working. If both Controllers go down simultaneously, the time needed to bring up a third Controller is about 5 minutes.
Big Cloud Fabric is based on OpenFlow with extensions. The white box switches that Big Switch sells have Broadcom ASICs that have several types of forwarding tables (programmable memory that can be populated). Some of the early implementations of OpenFlow only exposed the ACL table (which had only about 2000 entries), which didn’t scale well. The way Big Switch implements OpenFlow in Switch Light OS is to expose an L2 table and an L3 table, each with over 100,000 entries. They couldn’t go into more details as they were under NDA with Broadcom. Switch Light OS is Big Switch’s Indigo OpenFlow Agent running on Open Network Linux on x86 or ASIC-based hardware. Whereas traditional networks have clear L2/L3 boundaries in terms of devices, in Big Cloud Fabric L3 packets are routed on the first hop switch. If a tenant needs to talk to another tenant, packets go through a system router, which resides only on a spine switch.
Next up was Service Chaining and Network Functions Virtualization support. Service Chaining is implemented via next-hop forwarding. For example, at a policy level, if one VM or app needed to talk to another app, it could be passed through a service such as a load balancer or firewall (while leveraging the ECMP bits of the hardware) before reaching the destination. The demo showed how to create a policy and then, with a firewall service example, how to apply that policy, which is known as service insertion to an ECMP group. However, it is early days for this NFV use case and for more elaborate needs such as health monitoring, the recommendation is to use OpenStack. Interestingly, Big Switch integrates with OpenStack, but not VMware at this time (it is on the roadmap though).
Operational Simplicity was next on the agenda. Here Big Switch kicked off with the choice of control plane APIs to the user – CLI, GUI, or REST, which, generally speaking, would appeal to network engineers, vCenter administrators, and DevOps personnel respectively. High Availability is designed so that reactions to outages are localized as much as possible. For example, the loss of a spine only reduces capacity, the loss of a leaf is covered by the other leaf in the same rack (thanks to a dedicated link between the two) that has connections to the same servers (so the servers failover to the other leaf via LACP). The Hitless Upgrade process is truly hitless from an application perspective (a few milliseconds of data packets are lost) though capacity is reduced. A feature called Test Path shows the logical (at a policy level) as well as physical path a host takes to reach another host.
The final session was on the Network Monitoring features of Big Switch, namely Big Tap Monitoring Fabric. Sunit Chauhan, head of Product Marketing, said that the monitoring infrastructure is developed using the same bare metal switches that is managed from the same centralized controller. The goal is to monitor and tap every rack and ideally every vswitch. In a campus network that means the ability to filter traffic from all locations to the tools. The Big Tap Monitoring Controller is separate from the Big Cloud Fabric Controller and runs Switch Light OS as well. The example he gave was of a large mobile operator in Japan that needed thousands of taps. The only scalable (in terms of cost and performance) solution to to monitor such a network was to use bare metal Ethernet switches that report to a centralized SDN Controller.
The Big Tap Monitoring demo was based off a common design with a production network (which could be Big Cloud Fabric or traditional L2/L3 networks) with filter ports connected to a Big Tap Controller, which was then connected via delivery ports to the Tool Farm, where all the visibility tools existed.Of course Big Switch eats its own dogfood like every noble startup by deploying Big Tap Monitoring Fabric in its own office. They were able to capture the actual video stream of the NFD event that went out to the Internet from their office. Remote Data Center Monitoring is also supported now (though not demonstrable at NFD8), which reminded me of RSPAN except that this used L2-GRE tunnels.
A few afterthoughts: Big Switch used the marketing term ‘hyperscale data center’ like it was an industry standard and they gave examples of networks that were not hyperscale without explaining how they weren’t. In fact there was a slide that was dedicated to terminology used in a demo, but ‘hyperscale’ was not there. It reminded me of my days in a startup that used that same term in its marketing headlines without ever defining it.
From a personal perspective, in 2010 I worked as a Network Engineer in a large financial exchange where the Big Tap Monitoring Fabric would have been invaluable. Any time a trade was delayed by a couple of seconds resulted in potentially millions of dollars. The exchange would be spared that penalty if it could be proved that the delay was due to the application or the remote network and not the exchange’s own network. At that time we used network monitoring switches to determine where the delay occurred. But the location of those taps was critical. Moreover, it was just not scalable to have taps at every location off of every port. Since it was a reactive (troubleshooting) effort, it was really a Whac-a-Mole exercise. Ultimately, we went with a vendor that built the infrastructure to collect latency data from exchanges, and then offered the results to firms to allow them to monitor data and order execution latency on those markets. But it was expensive and those investments were between $10 and $15 million and ultimately that vendor went out of business. A solution like Big Tap Monitoring Fabric would have been a godsend. If Big Switch can figure out how to keep their costs down, they may have a huge opportunity in hand.
Tech Field Day events are intended to be very technical and this was no different. Slides with Gartner Magic Quadrants are usually met with rolling eyeballs, but I think Big Switch can be forgiven for having one reference to an industry analyst. Apparently according to Dell ‘Oro, in 2013 more ports were shipped from bare metal switches (from vendors such as Dell, Accton, and Quanta) that from Arista, Juniper, and Extreme combined!
While Big Cloud Fabric competes against the Cisco Nexus 7K product line, Big Tap Monitoring goes head to head against Gigamon. It was very refreshing to see a startup take on two behemoths, sheerly with clever engineering and nimble principles.
Musings on How to Build Modern Applications in the Cloud and Onprem