My vSphere Best Practices

Hi all,

I have been thinking of doing this for a while. I recently visited a customer who used VMware but not HA or DRS or VUM.  When I showed him what they were like he was excited and he is now using them.  But as I thought about that I thought of other things I had seen and realized it was a good time to share my own vSphere best practices.  These are things that I think very useful and appropriate.  I use them in the lab, but I also used to use them as a Professional Services guy at customer sites.  This is not meant to suggest you should blindly follow them.  But I do suggest they are very worthy of you thinking about them! Update: someone asked why.  VMware works when you install so why have these BP?  Well, they improve vSphere in the area(s) of security, performance, troubleshooting, reliability and supportability.

  • Memory burn-in overnight or over the weekend using a tool – not in the BIOS tools – such as memtest86.  I used to do this over the weekend in the old days, but now I just do it overnight.
  • BIOS
    • all virtualization stuff – should be enabled – usually found in Processor config
      • Execute Disable Bit enabled
      • Intel VT-x or AMD RVI enabled
      • In Power and Performance use Performance (or OS Controlled), with more info here
  • Your vCenter, and hosts, should all be in DNS before you start,
  • DNS should be healthy – your vC, hosts, should all be resolvable – long, short, forward and back.
  • Time needs to be consistent, and then if possible correct.  PDC in Active Domain should point at an edge router, or NTP on the Internet, with the same for the ESXi hosts.
    • Canada – 0.ca.pool.ntp.org, 1.ca.pool.ntp.org, 2.ca.pool.ntp.org
    • US – 0.us.pool.ntp.org, 1.us.pool.ntp.org, 2.us.pool.ntp.org
  • If you boot to SD, like I do, and even if it is mirrored like mine, or if you boot to USB, you should do some additional things.
    • move your /scratch somewhere on disk –  Normally it will not use USB or SD but rather memory so it is good to move it.  See this for more info.
    • Setup an external core dump – see here for more info.
    • Redirect your logs to an external log server – via Syslog – I use Log Insight but any syslog destination will work.
    • BTW, I do these three things even if I am not using SD.
    • If you are using vSphere 6.0 U3, and maybe later then vSphere 6.5 you can use this info to redirect VMware Tools to a RAM disk to minimize writes to your SD or USB.  BTW, if you don’t use SD, which I think you should, and if you use USB – like I do in two of my home servers, you should use this USB as I believe it is the best / most affordable quality.
  • You can skip the extra clicks to get into the vCenter Web Client. See here. That fixes if for you but if you want to fix it for everyone use this. You could also do a custom cert too!
  • Once you install vC (if it is a VCSA) you should regen the cert which means it matches your name (log into VAMI for this).  I do this right away.
  • On your second vCenter, you need to change the vCenter ID (vC Settings area) to something new.
  • I always configure SMTP and Runtime Extensions (vC Settings) when I first install vCenter.  Never know when you will need them working!
  • DRS and HA should be in use!
    • I like to use % based admission control, and normally around 20%.  Update: if you use this option you should do the math on  CPU / RAM reservation / in use for every VM and the hosts and pick a realistic number.  If you monitor the utilization of your hosts with vR Ops or some other tool, and you think you have a good handle on your utilization normal, inclding when doing VUM patches, or a failed host, you may want to disable HA Admission Control (I am starting to do this now – see this for more info.
    • I like to use Fully Automated DRS.
    • After vSphere is installed, vMotion a VM to and from every host – this makes sure networking, labeling and spelling are all good.
    • You should PSOD a host to confirm VM(s) comes back to life (thanks to HA).  It is good to know that HA works properly.  You can find out here how to do a PSOD.
    • Use Distributed Power Management (DPM) if reasonable but remember not supported with VSAN.
  • I use folders to organize, and don’t use Resource Pools to do that.
  • I really like tags and notes and I use them.  See more here.
  • If you are short on networks or VLANs you can put vMotion and FT on same isolated network – if using much of FT put on separate network.
  • Use 10 GB if possible, but if not, use two network cards for vMotion network, two more for storage network, and two more for VM network,
  • Where possible use LAG and remember that it requires switch support.  Learn more about this topic in this great book.
  • All of a virtual machines VMDK should be in the same folder – unless there is a reason to not do that. And question the reason hard.
  • Don’t have SSH enabled on your hosts unless you need it. When done disable it.  It keeps the attack surface small, but also that small yellow icon can obscure things in certain views.
  • If you have vR Ops, make sure to have VIN installed – here is why. This will allow you to know what apps are in use on a host or VM before you power it off. Also, you can see the health of a host or VM on the Summary page of Web Client. Good way to catch issues sometimes.
  • Use a log aggregation tool – I use and like Log Insight. Get all logs to it. ESXi, vC, storage, and network. Plus anything else. Will help you to troubleshoot, AND will help your support people support you better.
  • Unless I am in a customer situation that is very paranoid, I will enable TPS to improve memory consolidation and memory usage.  In particular in home and office labs this is often very useful.  See how in this.  Remember you make this change on each host, and it requires the VMs to be powered off and on, or vMotioned off and on the host.  Suggest do this in the beginning when there is no VMs.
  • NFS and iSCSI should be on separate and private non routing VLAN. No chap necessary for iSCSI.  If you have more then one array, or one array that is very large, you should have an NFS and iSCSI separate and private networks.
  • You should install and use VMware Support Assistant.  Even if you don’t use it to create / update / manage support calls it can upload you logs weekly to be checked for known issues.  See more info here. I really like this tool.  BTW, you can see all of the articles I have done about this wonderful tool using this tag.
  • You may have more then one Active Directory Domain Controller in your cluster and if so do a DRS anti – infinity rule to keep them separated. But use a soft rule – should (or ignore even) – so in case of patching or whatever they will be moved if necessary.  See how to do this.
  • If you have a web server that talks to a database server and the web server is important, use an affinity DRS rule to keep them together – on the same host they will talk much quicker then going through physical switches.
  • For a customer consider the VMware HCL to be key, but for a home lab only a guideline.
  • If you use Dell gear, it doesn’t cost much more to get OpenManage for vCenter so make sure to get it. It really helps you be proactive with patching and do so using maintenance mode and it has other useful functionality as well (like if you want, you can configure hardware issues that are risky to enable Maintenance mode!)  See more here.
  • I believe in patching, and keeping my vCenter and hosts current and I use VMware Update Manager to do hosts and VM Tools / Hardware updating.
    • My patch philosophy is simple – I patch my vC and hosts pretty quickly, often 5 – 30 days after the patch is released.  But I Google to see if I am going to hit anything and I read the release notes carefully.  I also check them a couple of times as they do get updated sometimes.
    • You can find help with Install & Configure of VUM.
    • As part of setting up VUM I do in fact add my email address to the notification of newly downloaded patches, and to the notification of recalled patches.  Very handy as there was one of those recently.
  • I use the vCenter appliance.
  • I do not use RDM.  Repeat after me, I do not use RDM.  Unless I have too.
  • I always check for storage vendor suggestions / recommendations for the implementation of their storage.  You need to be careful and understand what they are asking.  Also be careful of when you have multiple storage vendors and watch for recommended settings that step on each other.  For example, in my home lab I have Synology, QNAP, and FalconStor.  So I need to make sure I manage all of their recommended settings so that it works, and it doesn’t negatively impact my lab!
  • Something of interest that I do not do any longer is I used to use the VMware CPUID tool to see what the processor supported to make sure that the hosts in the cluster were all the same.  This help prevented or identify vMotion issues.  Here is an article about this.  I checked the date on the CPUID CD I have in my library – 4/11/12 so I guess it is not that old.  I mention this as while I don’t use it with every server any longer, it is good to know about.
  • You might disagree or debate with this one but I install the ESXi Host Client into my servers.  I use VUM to do that normally.  Find out more here, and here.  I find this most useful. Update – it is now part of vSphere (as of 6.0 U2) so you get it now without doing VUM.
  • I suggest using externally hosted VMware Tools.  This is more of a suggestion than a best practice but I think it is a very good idea.  It allows you more flexibility and a smaller ESXi image.  If you are using Auto Deploy this would in fact be a best practice.  Learn more about this here.
  • In case at some point Distributed Power Management (DPM) is going to be used I suggest to do a Standby / Resume power test on each host. In dev / test / home labs DPM can be quite useful. Background info on DPM here and here.
  • I suggest having PowerActions configure so you can do PowerCLI scripts really easy from within vCenter.  Find out how here.
  • I suggest having vCheck running so that each morning you can get a nice snapshot of your environment.  Find out how here.
  • If you use Supermicro hardware, and are installing, or upgrading to, vSphere 6.5 you may have a performance issue on local disks configured as datastores.  This article can help you solve that issue.
  • If you forward your ESXi logs to Log Insight (or another product for that matter) you may want to use the info in this article to avoid getting your events from both the host short name as well as the FQDN.
  • Here is an article that can show you how to set alarms on the database in your VCSA which I think is a pretty good idea. I think the default values of the alert is good too.
  • If you use Veeam replication you will note there is duplicate MAC address conflicts.  Luca has a great article that will hide those MAC address messages but not for the real MAC address issues.

It is important to repeat that while I do all that is listed here and believe it worthwhile you need to think about how it impacts you and tweak it to make it your best practice.

As vSphere changes functionality and features I will update this list.  In addition I will try and add to it by connecting it with more how to articles.

Updates:

  • 6/17/17 – added a link to Luca’s article on hiding the MAC conflicts for Veeam replicas.
  • 4/30/17 – added the link to Williams article about setting VCSA alarms on DB usage.
  • 3/14/17 – added the Update comment way above about why.
  • 3/11/17 – add a link to an article about how you can specifically set the hostname that your ESXi host will use in sending out logs.  If you do not do this, you will sometimes see a FQDN and sometimes just the short name.  This can be irritating but also, if you use Log Insight, increase your OSI count.
  • 3/1/17 – saw a great post on using SD safer by John Nicholson that I add above.  As of 6.0 U3 you can redirect VMware Tools to a RAMdisk that will minimize writes to your SD or USB that you boot your host from.  Particularly good for smaller and less quality USB.
  • 12/31/16 – added the link to Anthony’s article about SM local disk issues.  I have confirmed it impacts both new installs and upgrades.
  • 10/23/16 – added a link to an article that talks about admission control and updated info above to be more clear around Cluster admission control.
  • 4/23/16 – added link to my article on separating DC’s.
  • 4/15/16 – updated my comments / suggestions around using HA Admission Control.
  • 4/15/16 – updated with the PowerActions and vCheck info.
  • 3/31/16 – update with the DPM info.
  • 3/30/16 – update for the EHC being GA and the comment around externally hosted VMware Tools.
  • 2/10/16 – Added the hint or reminder about when you return TPS to normal, that VMs need to be restarted.
  • 2/9/16 – Added the info on VUM install, ESXi Host Client, and emails on notification.
  • 2/6/16 – Updated spelling and grammar and added comment on storage vendor recommendations.

Questions, and comments are greatly appreciated.

Michael

=== END ===

Tagged with:
Posted in How To
9 comments on “My vSphere Best Practices
  1. Thanks Michael, im sure many of this tips should come handy. Great post!

  2. mroushdy says:

    Thanks for the info

  3. Patrick says:

    sorry if it is a stupid question but what exactly you mean by “I always configure Runtime Extensions (vC Settings)”??? What are these runtime extensions?

  4. Not stupid at all. The Runtime Extensions are how applications like VIN talk to vC. By enabling the Extensions, VIN – with the proper credentials can talk to the vC via API. With no credentials, or with the Runtime Extensions not enabled, there would be no communication. I choose to always enable it, so I am dependent on credentials to protect things, but that is normal for me and it means better troubleshooting if there is an issue.

    Michael

  5. Patrick says:

    thanks. I will check if I can find those settings. Maybe it is good idea to switch web client to english instead of german translation 🙂

  6. Nice post.

    a couple of notes on SSH and TPS.

    Enable ssh timeouts to ensure that SSH is disabled even if you forget to switch it off
    TSP won’t be very efficient with default settings unless memory salting is disabled or set to a common value on all VMs

  7. Another thing – I support an idea of disabling Admission Control where possible. Provided that there is a proper monitoring system and vROps is configured for capacity management there is no need for Admission Control. Very ofter it gives people a false feeling of safety making them think they have enough resources reserved while leaving default per-VM reservations.
    Here are my thoughts on Admission Control – http://vmnomad.blogspot.com/2016/04/why-i-prefer-to-disable-vsphere-ha.html

    VMware finally made some improvements to Admission Control in vSphere 6.5 by taking into the consideration the amount of allocated resources and generating an alert when there is a risk that failure can cause performance impact.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: