3.2.8 Data Locality (Locality of Reference)

In computer science, data locality, also known as locality of reference, is the behavior of computer programs according to which a workload accesses a set of data entities or storage locations within some period of time with a predictable access pattern.

• Temporal locality – The probability that if some data (or a storage location) is accessed at one point in time, it will be accessed again soon afterwards.

• Spatial locality – The probability of accessing some data (or a storage location) soon after some nearby data (or a storage location) on the same medium has been accessed. Sequential locality is a special case of spatial locality, where data (or storage locations) are accessed linearly and according to their physical locations.

Data locality is particularly relevant when designing storage caches. For example, flash devices offer impressive performance improvements, at a cost, so efficient use of these resources becomes an important design factor.

Like any storage system, vSAN makes use of data locality. vSAN uses a combination of algorithms that take advantage of both temporal and spatial locality of reference to populate the flash-based read caches across a cluster and provide high performance from available flash resources.

• Every time application data is read by a virtual machine, vSAN saves a copy of the data in the Read Cache portion of the flash device associated with the disk group where the copy of the data resides. Temporal locality implies that there is high probability that said data will be accessed again before long.  In addition, vSAN predictively caches disk blocks in the vicinity of the accessed data (in 1MB chunk at a time) to take advantage of spatial locality as well.

• vSAN uses an adaptive replacement algorithm to evict data from the Read Cache when it is deemed unlikely that the data will be accessed again soon, and uses the space for new data more likely to be accessed repeatedly.

• vSAN makes replicas of storage objects across multiple servers for protection purposes. Reads are distributed across the replicas of an object for better load balancing. However, a certain range of logical addresses of an object is always read from the same replica. This approach has two important benefits:

To be clear, a fundamental design decision for vSAN is to not implement a persistent client-side local read cache. The decision was based on the following observations regarding local read caching on flash:

• Local read caching results in very poor balancing of flash utilization (both capacity and performance) across the cluster.

• Local read caching requires transferring hundreds of gigabytes of data and cache re-warming when virtual machines are migrated using vSphere vMotion between hosts to keep compute resources balanced.

• Local read caching offers negligible practical benefits in terms of performance metrics, such as latency.