This is one of the weirder issues I have had to deal with (and if we ever drink together I can share some really weird issues – some problems involving steel, and some salt in fact). Things are working again, but I do not have a true root cause. I am sharing my story in case it helps someone, and certainly it may provide some entertainment.
While my lab worked fine, there was some port flapping on one of the 10 GB ports in my production array and since a customer or two had this issue they thought to check it out on my array. So I turned all my VM and hosts off. Not a big deal as I can do it in 20 minutes and another 20 minutes gets it back up. I had done this once this week already with no issues.
I am remote, but use LogMeIn to connect to a laptop in this environment.
When I got the word that I could continue I was pretty happy. I have several blog articles pending, and I want my lab working to confirm some things, and for demo next week as I head off to VMworld.
So I start the hosts, make sure the array is awake, and turn on the AD DC. I only have one – although deploying another right now – and it came up fine. But I could not ping it. I checked and the network was not connected. I remembered that from the past. It is a VM Network on a vDS and 10 GB. I changed it to the VM Network that is vSS and now I could not ping. Things immediately are weird now. No more script – uncharted territory. The Windows OS was asking me about what Network Location – home or public. Of course when it does that there is no network connection. So I said home. Now ping could work.
Next up is vCenter, then View Connection and Security servers. Both of my view servers had the same network location prompt. But I could only ping them once I got them off the vDS. There was some additional odd issues now – I could ping some servers and not others. Some I could ping FQDN and short name and some only short.
My storage is all NFS on its own VLAN, and my Management is not on VAN, but on 1 GB and all of that was fine. But the storage was all IP based and not DNS so that may be important but it does use the vDS 10 GB.
Everything was red in View, since it could not see properly the domain, vC, Connection Server or Security Server. But it was running. I could ping the vC but the C# client could not connect – timeout error. And the Web Client could connect, but I could never log in successfully. Always got authentication errors.
By this point it is getting late, and a long day, and I put out sort of a moan on Twitter and fairly quick Rob and Rob say nice things and I appreciate that very much.
It is stressful since I am in class next week and VMworld the following week and need to do demos and show off the product. Plus I need to confirm some things are working. So late at night and no idea of how to make things work.
In the morning Olif checks the network gear and arrays and all is good – minor DNS issue but that is expected.
So I am desperate and now I say reduce and test so I can call VMware and talk about things better. So I move the SQL server, vC, CS, SS, AD, and a desktop to one server and make sure all are on same vSS. Same odd behavior. Some odd ping where short works but not long, and issues getting to vC. So I restart each of the VMs. Then only turn on the AD and ping everything – meaning hosts and arrays and it all works. Then I turn on vC and ping everything and it all works. Meaning hosts, arrays, and DC. Can I connect to vCenter? Yes, Holy Crap. I turn on a the View servers and I can actually connect in. No more LogMeIn. I now power on some of the 10 GB still connected appliances and machines and they all work. I even put some of the original machines like AD and View back on 10 GB vDS and it all still works.
Now I am happy and quite nervous. So I have some breakfast, and warn my wife guests tonight or not, I get the bone (of the big Prime Rib roast – she does mention there are three so I am ok) and a lot of wine.
I touch everything again and still good. So I use Support Assistant to submit an SR so I can have someone look at the logs and see what the heck. And I cannot. Somehow our access to support is gone. We may be a TAP Elite partner, and presenting with VMware, and Dave Siles and I know a lot of people at VMware, in fact it is filled with friends of mine. And yet no support. So I figure to take a break and write this. I must admit I skipped a bunch of stuff that I did not think important enough to share.
So all this above is fact. Here is some guesswork.
- I think vDS did not work at first. Maybe due to host restarts and vC not running.
- AD – my DC and DNS – was frazzled somehow. So not everything could authenticate. Think network related.
- vSS – VM Network might have been frazzled somehow. And the VM’s restarting may have got ports differently or something. But it seemed to help.
I used to do this same sort of stuff, in fact on the same type gear – although didn’t have DataGravity storage – and it was sure nice having a few operations guys, and a few fellow architects to talk with. And when that did not work, I could pace a little and bump into people like William Lam and Kit Colbert. Even sometimes Duncan Epping. So doing it alone is a bit tougher so I do appreciate very much Rob Nelson saying not alone and offering to help along with Rob Nolan. Thanks very much. And yes, I do miss PromB where I used to run into those smart and cool dudes.
I do have some new storage coming in – thanks to my boss Dave – which means I will not have to turn everything off and I think that should be helpful.
If I can get the logs up to VMware support and chat with them, and I learn anything, I will be sure to update this article.
If anyone has thoughts or ideas on any of this, please leave them in the comments – I will watch for them carefully. And if they help someone else avoid this it would be great.
And yes, you can laugh at me and how this went. Even I can laugh a little.
- 8/26/15 – thanks to the comments, and to Google, I found some interesting articles to help me understand a bit of what happened and how to mitigate in the future if it happens again. First via Michael Webster I found this, which is old but still educational, and from Duncan Epping I found this and this. Again very educational. And finally I found this. So lots of education and even a potential mitigation. So I now will make it a standard practice to have a distributed port group with ephemeral binding in my labs. If I have the issue again, I will put my vC, and domain controllers on that port group. Once everything is working again, I will put them back where they started – which is the best way to do this – as per the last article it is best to use ephemeral ports as rescue and not long term usage.
=== END ===