Architecting VMware vSAN 6.2 : vSAN Design Overview : 5.8 Understanding How Failures Impact vSAN : 5.8.2 Degraded Failures
   
5.8.2 Degraded Failures
Degraded failure events are recognized in vSAN whenever I/O failures are detected with any of the following hardware devices:
Mechanical disks
Flash-based devices
Storage controllers
Unlike absent failure events, degraded failure events immediately activate the resynchronization of the data among the other hosts in the cluster. This function is not configurable because when the previously listed devices fail, the failure is not likely to be temporary. Therefore, there is no benefit to waiting for the devices to come back online.
When this type of event is detected, the resynchronization operation is performed for objects that are no longer in compliance with their policies due to the failure. It is an expensive operation because it creates new replicas of the data that existed on the failed device or component, using the remaining replica for the component as the source. This data exists elsewhere in the cluster, on other hosts or on other mechanical disks of the same host, based on the configured failures-to-tolerate (FTT). The replicas of individual objects are not all created in the same place. Rather, the replicas are distributed around the rest of the cluster wherever there is spare capacity. Thus, the entire cluster is used as a hot spare.
Regardless of when it is activated, the resynchronization operation contends with the I/O of virtual machines and the resource available in the cluster. The operation could have a detrimental impact on the overall capabilities of the vSAN cluster if not sized and designed correctly. From a performance perspective, the resynchronization operation essentially limits the IOPS available to virtual machines in the cluster because of the operations being performed to recover the affected objects and components.
From a data availability perspective, whenever the resynchronization operation is unable to be completed, due to an inadequate amount of collective spare capacity in the cluster, data accessibility could be at risk. For instance, consider a scenario with a three-node cluster configured with the default availability policy setting of FTT=1. When a host failure occurs, the remaining two hosts are excluded from the resynchronization operation because they are already hosting objects and components for the affected virtual machines. In this scenario, the resynchronization operation waits until the failed node is brought back online, or a new one is added to the cluster, before resuming the resynchronization operation, and restoring the compliance value for data availability.
Table 10. vSAN Resilience
Component
VM
User Experience
Restore process
VSAN Recovery process
HDD
Keep running
No impact
HDD replacement
Add to vSAN cluster
Copy data on failure HDD to another (Instantly)
SSD
Keep running
No impact
SSD replacement
Add to vSAN cluster
Copy data on all HDD in disk group to other HDDs (Instantly)
RAID Adapter
Keep running
No impact
RAID adapter replacement
Reconfigure RAID
Add to vSAN cluster
Copy data on all HDDs under failure RAID Adapter to other HDDs (Instantly)
ESXi Host
Reboot VMs
on failure host
A few minutes
downtime
(Reboot automatically)
Replace host
Initial setup
Add to vSAN cluster
Copy data on all HDDs in the failure host (60 minutes later)
Network
Reboot VMs
A few minutes
downtime
(Reboot automatically)
Restore network
Copy data on all HDDs in isolated hosts to other HDDs (60 minutes later)