Solving issues using CloudPhysics

Hello all,

I know I said this was the last post of the year, but I thought of some cool things I wanted to share so here we are.  I had some interesting issues when I was at my previous employer.  I am going to share how I solved them using CloudPhysics.  I am not going to show you screen shots of the issue at my previous employer – for obvious reasons I think, but I will show screenshots of my own lab.  As well another issue I will discuss was found in my lab so that makes the screen shots easy.

Time Crisis
One of my jobs was to build a pod, that could be duplicated a bunch of times for onsite beta, and really a lot of times for Hosted beta.  These are two very important tools to empower the amount, and quality of feedback during the alpha and beta process of developing software.  Our first issue occurred in this environment.  One pod – which is really a vApp with about 13 virtual machines in it, and firewalled off from all of the other vApps, had an issue.  The application that was the onsite alpha was not able to let users log in.  I realized quickly that it was a time sync – or lack of time sync, issue.  All of the virtual machines in the vApp were synced except for vC / SSO.  The time was only 3 or 4 minutes off but that was all it took for this application to not allow logins. So two virtual machines wrong, and 10 or 11 right.  Interesting.  I thought I knew what the issue was.  When virtual machines start, they take the time from the host they start on, and use that to populate BIOS time which than feeds the OS time.  Once the VM starts up than NTP or Windows starts taking care of the OS time – if they can.  It looks like maybe one host had bad time.  We had no customers in our lab yet, so while I thought I knew the problem, and the solution, I was worried about the other vApps and if they might have an issue.  There would be 24 customers impacted by this potentially.

I just had CloudPhysics added to this environment – or a part of it anyway.  I thought I would be able to use it to find out the status of the NTP on the hosts.  And it did.  As seen in my lab, and not my previous employers, this is what I saw.

time1

The arrow points at the card I am going to use.

time

Now above we can see the list of my four hosts.  Soon to be 5 I hope.  Notice how the three NTP servers look the same here for each host?  At work I had one entry that did not look like the rest.  All three hosts were in one line.  It turns out that all three NTP hosts were added without commas, and NTP couldn’t handle it.  I was looking at a very large number of hosts and it was very easy to skim down the list and find something that did not match.  I than connected to that host and fixed the NTP settings.  While I was using CloudPhysics to solve this issue, I did have a plan b.  A friend – thanks Alan! did an interesting script for me.  I will be sharing that soon.  It would have worked quite well too.  I would have used a script to find the issue, and than one to fix it.  That fix one would have touched each server and taken quite a lot of time, unless I edited it.  So CloudPhysics was faster to find, and it was easier for me to fix in this case.  If there was multiple problem hosts the script would have been better to fix.

BTW, the software was vCAC and remember the story as it is sensitive to time!

Servers not reporting to Syslog
I wanted to be able to do a monitoring check in CloudPhysics and be able to see the status of a number of things.  NTP was a big help as in my previous job I was nervous about it.  But I was also concerned if all of my hosts – hundreds of them, were reporting to Syslog properly.  I thought some were missing.  I know I could have used a script (and yes, I have a good one I am going to share soon) but I did not want a script.  I wanted to look where I looked for other things – CloudPhysics.  I met some of the CloudPhysics dev guys at a VMworld party and they said no problem.  They were pretty sure they had the information and just had to surface it.  Which I learned means to create / update a card.  They said next morning.  I held myself back until around lunch, and there it was.

syslog1

And I am showing below a screenshot of my own small lab, and not the huge one I did this in originally.

syslog

So again, I was able to scroll and very quickly see what hosts were not configured correctly.  I ended up finding approximately 100 hosts not configured right.  And of course that was too many to fix myself, and so the Operations crew helped out.  I believe they fixed with Host Profiles. I had a script as plan b.

Hosts not manageable with Dell’s OpenManage Plug-in for vCenter
This is a tool I used in the 1.x days to help manage my hosts and it was very cool indeed.  I believe it was the best tool around as it was so well integrated with vC, and easy to use, so that when I said to update host firmware, it would schedule it with Maintenance Mode if I wanted or the next reboot.  Plus, if things like a host power supply died, it could put that host into maintenance mode.  Version 2 of this product was a vSphere Web Client version so I was eager to try it.  So I did.  It would only manage one of my hosts and not of the other Dell hosts.  So I was curious.  It turns out the Release Notes mentioned that you needed a certain version of BIOS for this tool to recognize and work with hosts.  Not really very logical to me but OK.  Again, I looked at CloudPhysics.  I had just seen something about host BIOS.

bios1

bios

I saw right away that I needed to update some of my host BIOS and of course the only way I knew to do that without an outage was the tool that could not talk to them at this BIOS level.  Wow.

But I did get the hosts updated (manually), and got the tool working, but than the eval license expired, and than I bought an Intel host.  So tool not so useful now but still very important and key for Dell shops.

Time Machine
I thought that this was secret, but I do see it on a lot of the newer cards so I am glad to be able to talk about it.  This is a big deal.  You can see it on the screenshot above of my host BIOS information.  You could use the drop down scroll of the Time Machine field to see what the values were as of each date / time.  Being able to know when something changed is a very important and a useful troubleshooting capability.  So I love this feature and look forward to seeing it on all of the cards.  Some of the other cards that have Time Machine currently include Cluster CPU Inventory, Cluster Overall Alarm Status, Cluster Overview, and ESXi Hosts Information.

Today we have looked at how CloudPhysics could help you solve problems by providing you with the necessary information to make the decision on what the issues is or perhaps what the solution might be.  I find this tool very useful, and try to look at it daily to see how my lab is behaving.

I hope that this helps you with CloudPhysics and your infrastructure or maybe gets you excited and interested enough to try out the 30 day eval as it is a very cool product.

Michael

Tagged with: ,
Posted in CloudPhysics, Home Lab

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: