Appendix A: Availability Considerations
   
Appendix A: Availability Considerations
vCloud availability depends on elimination of single points of failure (SPOF) in the underlying infrastructure, personnel with the appropriate skills being available, and suitable operational processes being in place and followed.
Table 18. vCloud Availability Considerations
Component
Availability
Failure Impact
Maintaining Running Workload
vSphere hosts
Configure all vSphere hosts in highly available clusters with a minimum of n+1 redundancy. This provides protection for the customer’s virtual machines, the virtual machines hosting the platform portal/management applications, and all of the vCloud Networking and Security Edge appliances.
In the event of a failure of a host, vSphere HA detects the failure within 13 seconds and begins to power on the host’s virtual machines on other hosts within the cluster.
vSphere HA Admission Control makes sure sufficient resources are available in the cluster to restart the virtual machines. The admission control policy Percentage of cluster resources is recommended as it is flexible while providing resource availability.
For a description of design guidelines about increasing availability and resiliency, see the white paper VMware High Availability: Deployment Best Practices: VMware vSphere 4.1 (http://www.vmware.com/files/pdf/techpaper/VMW-Server-WP-BestPractices.pdf.)
It is also recommended that vCenter is configured to proactively migrate virtual machines off a host in the event that the host’s health becomes unstable. Rules can be defined in vCenter to monitor host system health.
Virtual machine resource consumption
vSphere DRS and SDRS automatically migrate virtual machines between hosts to balance the cluster and reduce the risk of a “noisy neighbor” virtual machine monopolizing CPU, memory and storage resources within a host at the expense of other virtual machines running on the same host.
vSphere Storage I/O Control automatically throttles hosts and virtual machines when detecting I/O contention and preserves fairness of disk shares across virtual machines in a datastore. This makes sure that a noisy neighbor virtual machine does not monopolize storage I/O resources. Storage I/O Control makes sure that each virtual machine receives the resources it is entitled to by leveraging the shares mechanism.
No impact. Virtual machines are automatically migrated between hosts with no downtime by vSphere DRS or SDRS.
 
 
 
 
 
No impact. Virtual machines and vSphere hosts are throttled by Storage I/O Control automatically based on their entitlement relative to the amount of shares or the maximum amount of IOPS configured. For more information on Storage I/O Control, see the white paper Storage I/O Control Technical Overview and Considerations for Deployment (http://www.vmware.com/files/pdf/techpaper/VMW-vSphere41-SIOC.pdf).
vSphere host network connectivity
Configure port groups with a minimum of two physical paths to prevent a single link failure from impacting platform or virtual machine connectivity. This includes management and vMotion networks. Use the Load Based Teaming mechanism to avoid oversubscribed network links.
No impact. Failover occurs with no interruption to service. Configuration of failover and failback as well as corresponding physical settings such as PortFast are required.
vSphere host storage connectivity
vSphere hosts are configured with a minimum of two physical paths to each LUN or NFS share to prevent a single storage path failure from resulting in an impact to service. Path selection plug-in is selected based on the storage vendor’s design guidelines.
No impact. Failover occurs with no interruption to service.
Maintaining Workload Accessibility
VMware vCenter Server
vCenter Server runs as a virtual machine and makes use of vCenter Server Heartbeat.
vCenter Server Heartbeat provides a clustered solution for vCenter Server with fully automated failover between nodes providing near zero downtime.
VMware vCenter Database
VMware vCenter Database resiliency is provided with vCenter Heartbeat if MS SQL is used or Oracle RAC.
vCenter Heartbeat or Oracle RAC provides a clustered solution for a vCenter database with fully automated failover between nodes providing zero downtime.
vCloud component databases (vCloud Director and Chargeback)
VMware vCloud component database resiliency is provided through database clustering. Microsoft Cluster Service for SQL and Oracle RAC are supported.
Microsoft Cluster Service and Oracle RAC supports the resiliency of the vCloud Director and Chargeback databases as it maintains vCloud Director state information and the critical Chargeback data required for customer billing respectively. Though not required to maintain workload accessibility, clustering the chargeback database protects the ability to collect chargeback transactions so that providers can accurately produce customer billing information.
VMware vCenter Chargeback
Multiple Chargeback, vCloud, and vCloud Networking and Security Manager data collectors are installed for active/passive protection.
If one of the data collectors goes offline, the other picks up the load so that transactions continue to be captured by vCenter Chargeback.
vCloud Infrastructure Protection
Component
Availability
Failure Impact
Manager
VM Monitoring is enabled on a cluster level within HA and uses the VMware Tools heartbeat to verify that virtual machines are alive. When a virtual machine fails and the VMware Tools heartbeat is not updated, VM Monitoring verifies if any storage or networking I/O has occurred over the last 120 seconds before restarting the virtual machine.
It is highly recommended to configure scheduled backups of vCloud Networking and Security Manager to an external FTP or SFTP server.
Infrastructure availability is impacted, but service availability is not. vCloud Networking and Security Edge devices continue to run without the management control, but no additional edge appliances can be added and no modifications can occur until the service comes back online.
vCenter Chargeback
vCenter Chargeback virtual machines can be deployed in a cluster configuration. Multiple Chargeback data collectors can be deployed to avoid a single point of failure.
There is no impact on Infrastructure availability or customer virtual machines. However, it is important to keep vCenter Chargeback available to preserve all resource metering data.
Clustering the vCenter Chargeback servers protects the ability to collect chargeback transactions so that providers can accurately produce customer billing information and usage reports.
vCloud Director
The vCloud Director cell virtual machines are deployed as a load balanced, highly available clustered pair in an N+1 redundancy set up, with the option to scale out when needed.
*Session state of users connected via the portal to failed instance is lost. Users can reconnect immediately.
*No impact to customer virtual machines.
vCloud Networking and Security Edge (Edge)
Edge can be deployed through the API and vCloud Director web console. To provide network reliability, VM Monitoring is enabled. In case of an Edge guest OS failure, VM Monitoring restarts the Edge device. Edge appliances use a custom version of VMware Tools and are not monitored by vSphere HA guest OS monitoring.
Edge Gateway 5.1 provides the following HA capabilities:
*Network HA – Customer can choose to deploy two appliances working in an active-passive configuration. A stateful failover occurs if the active dies. Then, a second appliance is deployed, and it becomes the new passive.
*VMware HA – If the vSphere host dies taking an appliance down with it, the appliance is restarted on another vSphere host
*Application HA – We monitor the internals of the appliance for process lock-up and so on, and trigger VMware HA failover if we detect problems.
*Partial temporary loss of service. Edge is a possible connection into organization.
*No impact to customer virtual machines or Virtual Machine Remote Console (VMRC) access.
*All external network routed connectivity is lost if the corresponding Edge appliance is lost.
vCenter Orchestrator
Plan for high availability of all systems involved in the orchestration workflow. Design the workflows to remediate the non-availability of orchestrated systems (for example, by alerting and retrying periodically).
High availability for vCO can be provided by vSphere HA and vSphere FT in addition to application-based clustering.
As long as a copy of the database is available, a vCenter Orchestrator Application Server with the appropriate configuration can resume workflow operations. An active-passive node configuration best suits vCenter Orchestrator.
Temporary loss of access to end users interacting directly with vCenter Orchestrator.
Disruption to workflows executed by vCenter Orchestrator. This includes workflows started by vCenter Orchestrator and workflows started by external applications.