8.1 vSphere High Availability

vSphere High Availability (vSphere HA) operates at the cluster level, and when enabled, provides the capability to monitor vSphere hosts for failures and automatically restart one or more virtual machines if deemed necessary. At a cluster level, vSphere HA monitors all hosts in the cluster through both network and datastore heartbeats. In the event that the network heartbeat between a slave node and master node in a cluster is lost, the master node attempts to use the datastore heartbeat to verify that the slave node is still operational. If the datastore heartbeat has stopped as well, the slave node is determined to have failed and the master node begins restarting the appropriate virtual machines on other nodes in the cluster. This is the reason for the requirement for reserved unused capacity within the vSphere cluster. The VMware Cloud Provider Program platform uses this vSphere core functionality as the foundation for service availability to their enterprise business tenants.

In a vSAN based cluster, this mechanism works slightly differently. On a vSAN platform, datastore heartbeats become irrelevant, and the vSphere HA agent uses the vSAN network to communicate instead of the management network. However, the management gateway is still used by the host to detect if it has become isolated. For more information about vSphere HA behavior in a vSAN enabled cluster, see the vSAN Support Center at http://www.vmware.com/uk/support/virtual-san.

Designing a scalable and reliable vSphere HA environment for the service provider does not differ significantly from designing the same mechanism for the enterprise business customer. However, a service provider’s platform is heterogeneous by its very nature due to its lack of control over the applications provisioned by the tenants.

Each vSphere cluster within the VMware Cloud Provider Program is typically configured for vSphere HA to automatically recover virtual machines should any ESXi host fail, or even if there is a specific virtual machine failure. An exception to this rule might exist where a lower cost service offering is being provided to the tenants that does not include a strict availability clause in its SLAs.