3.2.4.1 Data Integrity (Software Checksum)

Software checksum enables service providers to detect the corruptions that could be caused by hardware/software components, including memory, drives, and so on during the read or write operations. In case of drives, there are two basic kinds of corruption. The first are latent sector errors, which are typically the result of a physical disk drive malfunction. The second type are silent corruption errors, which can happen without warning (these are typically called silent data corruption). Undetected or completely silent errors can lead to lost or inaccurate data and significant downtime. There is no other effective means of detection for these errors without an end-to-end integrity checking mechanism.

During the read/write operations, vSAN checks for the validity of the data based on the checksum. If the data is not valid, vSAN takes the necessary steps to either correct the data or report it to the user to take action. These actions could be as follows:

• To retrieve a new copy of the data from other replica of the information, stored within the RAID1, RAID5/6 constructs. This is referred to as recoverable data.

In the case of data errors, issues are reported in the vSphere Web Client user interface and in log files. These include impacted blocks and their associated VMs, allowing the administrator to see the following:

The data integrity feature uses a CRC32 algorithm, which also supports CPU offload to reduce overhead. In addition, there are two levels of scrubbing employed:

1. Component-level scrubbing – every block of each component is checked. If there is a checksum mismatch, the scrubber tries to repair the block by reading other components.

2. Object-level scrubbing – for every block of the object, data of each mirror (or the parity blocks in RAID-5/6) is read and checked. For inconsistent data, all data in the affected stripe is marked as bad.

The repair can either happen during normal I/O operations by the DOM owner of that object, or by the scrubber, although the repair mechanism for a mirrored or RAID-5/6 operation is different. If checksum verification fails, the scrubber or DOM owner reads the other copy of the data (or other data in the same stripe in case of RAID-5/6), and rebuilds the correct data by writing it out to the bad location.

This end-to-end checksum of the data aims to prevent data integrity issues, which could be caused by silent disk errors, with the checksum being calculated and stored on the write path, or with silent corruptions being detected when reading the data through checksum data.

When checksum verification fails, vSAN automatically read a different copy of the data (or other data in the same stripe in the case of RAID-5/6), then rebuild the corrected data, writing it out to the bad location, based on a 4KB block size.

Data Integrity (Software Checksum) is a cluster wide setting, which by default is switched on. However, this value can be disabled on a per-object basis, using storage policies.