This is one of those things that will likely not impact you, but I write it down just in case! I had, late yesterday, a vSphere host die. This was in fact in the middle of a SureBackup job, and the host held my vCenter, and my main View server. The server did not die completely right away – first it dropped networks, then networks came back that was like network switch issues but where there wasn’t a switch issue. After trying a few things – like maintenance mode, and turning VMs off, where none of this actually worked but I could at least sort of try, I decided to restart the host. I was able to log in fine, and restart, but after it forced off VMs, and showed restarting on the console, it just stayed there for 2o or 30 minutes. After 30 minutes of no change I hard powered off the host and brought it back. I used the Host Client (love that tool) to power on the VMs, and all was good. I could access the vCenter, and View again but then I had to head to dinner. I did see something that I thought might make my morning interesting. One of the VMs in the SureBackup Job was still around.
So in the morning when I checked my lab email there was a lot of failures. Some were normal like the BackupCopy jobs that had nothing to copy (because the backup job didn’t fire) but there were three or four jobs that did not go and all had the same error message. Here is one of them.
BTW, another look at this error – from a different VM is seen below.
So time to figure this out. Errors around snapshots and too many disks is what I get out of this. So on this server – VUM – I will check the number of disks and see if I can do a snapshot. So VUM doesn’t have any extra disk, but certainly the Veeam server does. I think if the Veeam server restarts, or a job starts, it is supposed to clean up those disks. My VUM VM has no snaps, but also, if I wanted to I could not take one. So interesting. So I have restarted two hosts now and things seem to be mostly working now. I can in fact take snaps on VUM OK. And Veeam has been restarted but still has four extra disks. So I will try the backup for VUM again. It had the error message above – both of them before I start. And, it took a long time – 5 minutes to start the BU so I suspect it was doing some cleanup. So the VUM job went fine and finished perfect. I am doing another one now that has the same error and I suspect it will work fine. It also has one of the failed SureBackup Jobs associated with it. While the backup job worked fine the SureBackup job did not. And it got an interesting vSphere HA error. WTF!
Why did my host die you ask? I have not had that behavior in my lab for years. Two things happened in the last couple of days – I applied the outstanding host patches, and I turned on VSAN. If I had support I would really love to submit my logs and chat with them as I think there is something to learn here. This was one of the most interesting server crashes! But, as a vExpert I get licenses but no support. I have grabbed full Veeam and VMware logs in case it happens again and on the off chance I get to talk to VMware support! Oh, crap. While investigating my Veeam issue, I notice my VUM VM is not compliant, meaning something is wrong – in this case – with VSAN. A host, not the one that crashed last night is a little confused. Can’t get HA to work on it, and I cannot vMotion anything off of it – not in that state is the error. So while my services of View, Log Insight, and vCenter were working after the crashed host came back, not all was good! I could not put this host into maintenance mode. So time to do a hard power reset. And again, this is not the host that crashed yesterday. So it seems things are working better now after two host restarts. But you cannot make VSAN hosted VMs compliant and when you look at VSAN health there is a lot of red. Lots of resync time is necessary to make VAN happy now. And it is all the host that crashed last night. I hope that it improves as time goes on? I so wish I could have a support call. This has been the worst VSAN experience ever for me. And remember I go back to when VSAN wasn’t in the vSphere branch and no UI.
So I got everything cleaned up, and had to recreate a SureBackup job and got things running again. Then, a few days later, I had the same issue. One host ( a different host this time) is disconnected but looks fine. I try to shut it down and it hangs on the Shutdown in progress. I guess it is time to get rid of VSAN. Not its fault mind you.
This article is mostly about me tracking what happened, and some of my frustration. I am happy I can recover my VMs after a host crash by restarting my host that crashed. It has happened to two of them but each time I get things back. The second time nothing of Veeam was running so that was easy. I should be able to handle a single host outage but for some reason some VMs do, and some don’t. The ones that don’t I suspect were on the crashed host but are not getting restarted by HA.
I so wish I could share logs with VMware! If anyone at VMware reads this and wants them or access let me know.
- Here is more on this issue.
=== END ===