This past week, I have run into two different HA Agent alerts and issues that have thrown up alerts or caused me some administrative headaches. As a reference point, we are running vSphere/vCenter 5.1 but I feel these issues affect a broader range of products based on the KB articles I have come across.
Issue 1: Within one of our clusters, a VM was rebooted by HA because of a backup issue. I’m thankful that HA saw the issue and rebooted the VM. So quickly that our monitoring solution didn’t even notice downtime. That’s great! The alert was thrown at the cluster level for obvious reasons and put a yellow alert banner on the Summary tab of the cluster, not in the Alarms tab. The yellow banner indicated that “HA initiated a failover in <cluster> on <datacenter>”. I don’t see an alarm in the Definitions that is specific to this alert as it was for a single VM. I guess that is why it wasn’t displayed in the Alarms tab. Now how do I acknowledge and clear the alert? I discovered a KB article (2004802) and it describes my issue exactly. The cause is written as:
This issue occurs when a HA failover event occurs in the cluster, which triggers the warning message. This locks the warning message and prevents it from being removed.
I don’t like the last sentence of that cause. Why lock it? Let me acknowledge and clear the warning. As described in the resolution, I disabled HA on the cluster and re-enabled it. The alert was gone as expected.
It looks like this affects vCenter 4.0-5.5. I assume this is not seen often as clearing an alert in this manner is downright inefficient.
Issue 2: During a troubleshooting session with VMware support, I was asked to reboot a host. No big deal. After our troubleshooting completed, I noticed that DRS was not migrating the VMs back to the rebooted host. I attempted to manually vMotion a VM to the host in question but the wizard indicated that the HA Agent on the host was “Unreachable.” I did a quick search and found the following KB article (2011192). The symptom description was word for word what I was seeing from the host. Some relevant notes:
1. The host was accessible by vCenter
2. This host was the only host showing these symptoms.
3. All hosts and vCenter reside in the same VLAN.
I attempted the following to resolve the issue with no luck:
1. “Reconfigure for vSphere HA” on the host.
2. Restarted management agents on the host.
3. Rebooted the host again.
In the KB article, in mentions to restart the vCenter service. I felt this was overkill as the issue was isolated to a single host so I did not perform that troubleshooting step. Much of the resolution steps in the KB article talk about the host as Not Responding, but this was not the case.
In the end, I disabled HA at the cluster level and then re-enabled it. After that, all of the HA Agents on each of the hosts in that cluster reported back correctly.
**When in doubt, just disable HA and re-enable it across the cluster. In the vCenter HA world, it is the equivalent to rebooting a computer to clear any weird issues.**