A couple weeks ago, our Network Operations team stumbled upon numerous MACs flapping on their Cisco switches. We began investigating where these MACs were in the data center as every switch stack in every row was seeing this issue. An example of what we saw is listed below:
|10/24/2012 2:27:19 PM||appsw01-c8-gis-omaedc||Warning||1622524: . Host 0021.5add.383d in vlan 700 is flapping between port Po9 and port Po8|
|10/24/2012 2:27:19 PM||appsw01-c8-gis-omaedc||Warning||1622523: . Host 0025.b382.2561 in vlan 707 is flapping between port Po37 and port Po36|
|10/24/2012 2:27:19 PM||appsw01-c8-gis-omaedc||Warning||1622520: . Host 0025.b382.2561 in vlan 425 is flapping between port Po37 and port Po36|
|10/24/2012 2:27:19 PM||appsw01-c8-gis-omaedc||Warning||1622522: . Host 0025.b382.2561 in vlan 703 is flapping between port Po37 and port Po36|
|10/24/2012 2:27:19 PM||appsw01-c8-gis-omaedc||Warning||1622521: . Host 0025.b382.2561 in vlan 450 is flapping between port Po37 and port Po36|
After digging, we found that it was the stacking link MAC address on our Virtual Connect modules in our HP c7000 enclosures. Next, I had to determine if it was from every enclosure, 29 total, or only certain enclosures that had something in common. An email was sent to our HP Account Support Manager about the issue and if he had any prior experience. He mentioned he has seen instances relating to ESX servers, NIC drivers, or LLDP packets not handled correctly.
Through our investigation, enclosures without ESX servers were causing this issue. We doubted the NIC driver issue, since it was the stacking link. Our network team went down the LLDP route initially but it resulted in no change. We called HP support to go further and one of the engineers provided the following customer advisory. The description matched our issues as the one thing we noticed was that the enclosures with VC 3.15 (we are on our last month of VC upgrades to 3.60, just in time to start upgrades to 3.70!) were not causing the issues. The advisory indicates the Network Loop Protection setting was put into place in version 3.51 and affects later versions. The NLP frame being transmitted every five seconds was aligned with what we saw in the logs as well.
HP support could not comment on whether any pings would be lost when the setting was disabled. The description is a bit vague on that question but they, along with our Account Support Manager said it shouldn’t but they didn’t have first hand knowledge as to if it would cause any network disruption during the disable process. We scheduled a change time late at night.
Good news followed immediately. As soon as I applied the change, no pings were lost and our switch logs began clearing up. Days later, there is still no flapping seen and our switch CPU usage has dropped to normal levels. Unfortunately, it doesn’t look like this issue is in the fixes under VC 3.70.