8. vCloud Operations Control : 8.8 Continuity Management : 8.8.1 Disaster Recovery
8.8.1 Disaster Recovery
Disaster Recovery (DR) focuses on the recovery of systems and infrastructure after an incident that interrupts normal operations. A disaster can be defined as partial or complete unavailability of resources and services, including software, the virtualization layer, the vCloud layer, and the workloads running in the resource groups. Different approaches and technologies are supported, but there are at least two areas that require disaster recovery: the management cluster and consumer resources. Different approaches and technologies are supported. Management Cluster Disaster Recovery
Good practices at the infrastructure level lead to easier disaster recovery of the management cluster. This includes technologies such as HA and DRS for reactive and proactive protection at the primary site. VMware vCenter Heartbeat™ can also be used to protect vCenter Server at the primary site. For multi-site protection of virtual machines, VMware vCenter Site Recovery Manager™ (SRM) is a VMware solution that works well, because the management virtual machines are not part of a vCloud instance of any type (they run the vCloud instances). For a detailed description of using SRM to provide disaster recovery solution for the management cluster, see http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.pdf.
Disaster Recovery operational considerations for the vCloud management cluster are the same as for a virtualized environment. A vCloud infrastructure risk assessment must be undertaken to determine the threat risk exposure and the corresponding mitigation activities. The actions necessary for executing the mitigation activities, including those for the management cluster, should be captured in a vCloud infrastructure continuity plan. After the vCloud infrastructure disaster recovery planning and technical implementation are complete, awareness building, disaster recovery training, disaster recovery testing and review/adjustment should be considered part of ongoing vCloud operations.
VMware vCenter Site Recovery Manager 5 can perform a disaster recovery workflow test of the Cloud management cluster. This can be useful to verify that the steps taken to move the Cloud management stack from the protected site to the recovery site complete without fail. But, the SRM test feature is only validation of the workflow, not functional testing of connectivity (due to the fencing feature that is used to protect the production vCloud management cluster). vCloud Consumer Resources Disaster Recovery
The vCloud consumer resources (workloads or vApps) can be failed over to an alternate site, but VMware vCenter Site Recovery Manager (SRM) cannot be used. Although SRM is vCenter Server-aware, it is not vCloud Director-aware. Without collaboration between vCloud Director and SRM, the underlying mechanisms that synchronize virtual machines cannot be used to keep vCloud consumer resources in sync.
A solution for vCloud consumer workload disaster recovery is to use storage replication. Storage replication can be used to replicate LUNs that contain vCloud consumer workloads from the protected site to the recovery site. Because the LUN/datastores containing vCloud consumer workloads cannot currently be managed by SRM, manual steps might be required during failover. Depending on the type of storage used, these steps could potentially be automated by leveraging storage system API calls.
Operationally, recovery point objectives support must be determined for consumer workloads and included in any consumer Service Level Agreements (SLAs). Along with the distance between the protected and recovery sites, this helps determine the type of storage replication to use for consumer workloads: synchronous or asynchronous.
For more information about vCloud management cluster disaster recovery, see http://www.vmware.com/files/pdf/techpaper/vcloud-director-infrastructure-resiliency.pdf.