5.8.2 Degraded Failures

Unlike absent failure events, degraded failure events immediately activate the resynchronization of the data among the other hosts in the cluster. This function is not configurable because when the previously listed devices fail, the failure is not likely to be temporary. Therefore, there is no benefit to waiting for the devices to come back online.

When this type of event is detected, the resynchronization operation is performed for objects that are no longer in compliance with their policies due to the failure. It is an expensive operation because it creates new replicas of the data that existed on the failed device or component, using the remaining replica for the component as the source. This data exists elsewhere in the cluster, on other hosts or on other mechanical disks of the same host, based on the configured failures-to-tolerate (FTT). The replicas of individual objects are not all created in the same place. Rather, the replicas are distributed around the rest of the cluster wherever there is spare capacity. Thus, the entire cluster is used as a hot spare.

Regardless of when it is activated, the resynchronization operation contends with the I/O of virtual machines and the resource available in the cluster. The operation could have a detrimental impact on the overall capabilities of the vSAN cluster if not sized and designed correctly. From a performance perspective, the resynchronization operation essentially limits the IOPS available to virtual machines in the cluster because of the operations being performed to recover the affected objects and components.

From a data availability perspective, whenever the resynchronization operation is unable to be completed, due to an inadequate amount of collective spare capacity in the cluster, data accessibility could be at risk. For instance, consider a scenario with a three-node cluster configured with the default availability policy setting of FTT=1. When a host failure occurs, the remaining two hosts are excluded from the resynchronization operation because they are already hosting objects and components for the affected virtual machines. In this scenario, the resynchronization operation waits until the failed node is brought back online, or a new one is added to the cluster, before resuming the resynchronization operation, and restoring the compliance value for data availability.

Component	VM	User Experience	Restore process	VSAN Recovery process
HDD	Keep running	No impact	• HDD replacement • Add to vSAN cluster	Copy data on failure HDD to another (Instantly)
SSD	Keep running	No impact	• SSD replacement • Add to vSAN cluster	Copy data on all HDD in disk group to other HDDs (Instantly)
RAID Adapter	Keep running	No impact	• RAID adapter replacement • Reconfigure RAID • Add to vSAN cluster	Copy data on all HDDs under failure RAID Adapter to other HDDs (Instantly)
ESXi Host	Reboot VMs on failure host	A few minutes downtime (Reboot automatically)	• Replace host • Initial setup • Add to vSAN cluster	Copy data on all HDDs in the failure host (60 minutes later)
Network	Reboot VMs	A few minutes downtime (Reboot automatically)	• Restore network	Copy data on all HDDs in isolated hosts to other HDDs (60 minutes later)