A War Story on STP

I’m a big fan of the Packet Pushers. The quality of the content covered is unmatched and I make every effort to keep on top of their podcasts. In a recent two-part podcast just before the Christmas 2013 holiday, the hosts of the show, Greg Ferro and Ethan Banks, brought together a few network engineers to relive some of their worst nightmares on the job – the moments when the network went down unexpectedly and all hell broke loose. This made me think back to one of my experiences as a network engineer on a steamy afternoon in 2008…

I was a Consultant filling in as a Network Engineer in a staff augmentation project at a large enterprise’s corporate headquarters, which completely occupied a ten-story tower. It was hot, humid Tuesday with temperatures in the upper 80’s (mid 30’s in Centigrade) and humidity hitting 90%. I was starting my first day on the project at this client. I had barely introduced myself when complaints started to file in about a network outage. As the number of complaints started to increase rapidly and a pattern was established, it was becoming clear that this was more serious than just an application being offline. Theories started emerging about a worm attack. (In those days the threat of worms was more prevalent than the fear of being hacked by the Chinese). The CTO called all hands on deck and asked me, the new guy on his first day, the contractor who had no clue about their network, to help out. What follows is an edited report that I submitted to my client that discussed the root cause of the outage. Obviously, the names have been hidden to protect the guilty.

The client’s practice stated that BPDU Guard must be enabled on Access ports, in conjunction with PortFast, in order to enforce STP domain borders and provide a loop-free Access layer.

Their reasoning behind that was as follows: Ports enabled with PortFast should not receive any BPDUs because the ports do not participate in the Spanning Tree. Hence, any BPDUs received on these ports are invalid. In the event that someone mistakenly connects a new switch to the PortFast port, or loops the port to another switch, PortFast will bring the port to blocking mode if it receives BPDUs. But then it brings the port back up through the normal spanning-tree process, which essentially renders the PortFast feature useless in that event.

This is why BPDU Guard feature was needed. The BPDU Guard feature error disables the port completely when a BPDU is heard on a PortFast interface. In theory, any host behind the ports that have STP PortFast enabled cannot influence the STP topology because upon receiving a BPDU, the BPDU guard operation would disable the port that has PortFast configured. The BPDU guard would place the port into errdisable state, and a message would be logged.

The client believed their network was correctly configured because of the following standard safeguards they had on each access switch:

interface FastEthernetX/X
 switchport access vlan X
 switchport mode access
 switchport voice vlan Y
 switchport port-security maximum 2 
 switchport port-security
 switchport port-security aging time 2
 switchport port-security violation restrict
 switchport port-security aging 
 no logging event link-status
 srr-queue bandwidth share 10 10 60 20
 srr-queue bandwidth shape 10 0 0 0
 mls qos trust device cisco-phone
 mls qos trust cos
 no snmp trap link-status
 macro description cisco-phone
 auto qos voip cisco-phone
 spanning-tree portfast
 spanning-tree bpduguard enable
end

At approximately 3:18 PM on August 5, 2008, a PC technician inadvertently looped a cable between two jacks on the 8th floor of the building. These jacks corresponded to ports on two different closet switches on the 8th floor. The following occurred immediately after the cable was looped:

  1. Ports on both ends went straight to the STP Forwarding state because the STP PortFast feature was enabled.
  2. Both ports starting receiving MAC addresses from each other, which triggered Port Security.
  3. Both switches were unable to transmit a BPDU because Port Security blocked all traffic from unknown unicast MAC addresses.
  4. Since no BPDU was sent or received, BPDU Guard was unable to take effect and the port was not error disabled.

The BPDU Guard feature would have worked fine as long as both end ports were not configured with PortFast, which put the ports immediately into forwarding state. Port Security kept the switches from transmitting a BPDU across the link. The ports were so busy trying to get their CAM tables across the link that Port Security kept kicking in and kept the BPDU from crossing the link. Once the 8th floor was isolated by disconnecting the uplinks to the switches on the Core (at 4:04 PM and 4:07 PM respectively), the loop was broken. Once the loop was broken, the Access switches were able to send BPDUs, which consequently allowed BPDU Guard to take effect and put the ports into err-disable state.

The recommendation I gave to my client to avoid similar outages in the future was to replace the restrict keyword in the Port Security settings with shutdown under each interface as follows:

Switch(config-if)# switchport port-security violation shutdown

This would have put the port in err-disable state until either it is manually re-enabled or an errdisable recovery option is specified globally as follows:

Switch(config)# errdisable recovery cause bpduguard 
Switch(config)# errdisable recovery interval 300

Alternatively, they could have disabled Port Security altogether to avoid running into this scenario.

The other recommendation I gave was to extend the Layer 3 boundary to the Access Layer. However, that redesign would have required considerable planning because the access VLANs spanned multiple wiring closets across multiple floors.

STP loops and broadcast storms have caused many a network outage and are the reasons why newer technologies, such as the IETF standard TRILL, the IEEE standard SPB, and the numerous vendor proprietary solutions, such as Cisco’s FabricPath, Brocade’s VCS, and Juniper’s QFabric  were created. However, they tend to be greenfield solutions that require hardware with redesigned chipsets and are huge investments for any enterprise.

Network Engineers are rarely made into heroes. But luckily, since this was my first day on the job and I carried no baggage, I got a pretty big thumbs up!

Advertisements

One thought on “A War Story on STP”

  1. Interesting post. I ran this up in the lab and indeed with ‘violation restrict’ I can connect a switch up to the port and not hit bpduguard.

    I’ll be changing my build scripts to ‘violation shutdown’ for future builds!

    Surprisingly not much else on the web describing this scenario…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s