5.2 Eliminating Single Points of Failure

In addition to protecting the physical infrastructure, any VMware Cloud Provider Program platform must have high availability at the core of every design decision. Therefore, eliminating single points of failures is a key design requirement.

A single point of failure could render an entire cloud platform unavailable if a central component were to fail, with dire consequences for the service provider. Typically, single points of failure are mitigated by the use of redundant hardware limiting the impact of and helping to avoid service interruptions. For this reason, it is paramount that VMware Cloud Providers design their core platform with the following considerations:

• Physical hard disk drives with spinning spindles have a relatively low mean time between failures (MTBF) and must be protected because they contain customer and provider data that is frequently accessed. To mitigate against single or multiple hard disk failures, employ RAID or similar technologies.

• Servers, switches, and other critical cloud platform hardware must support multiple power supplies, fed from separate circuits. Fans and other moving parts that are susceptible to failure must be fully redundant within the hardware.

• Ethernet or converged network adaptor cards can be protected by employing the IEEE 802.3ad or similar protocol to aggregate links.

• A storage area network (SAN) must be designed with redundant components throughout. This includes, but might not be limited to, Host Bus Adaptor (HBA) cards, Fibre Channel (FC) switches, and storage controllers. Many storage vendors also support additional technologies within the array such as ALUA, mirrored cache memory, and multiple paths to access disks. Redundant access paths will also maintain access to data from the ESXi host across the redundant fabric through a process referred to as multipathing or multipath I/O (MPIO). This technology is native in vSphere and forms part of the hypervisor Pluggable Storage Architecture (PSA). However, third-party products, such as the EMC PowerPath might be recommended by some storage vendors to further enhance this native functionality.

• Other less common technologies for maintaining hardware availability might also provide good value depending on the hardware vendors chosen. For instance, technologies such as memory mirroring or the NEC Fault Tolerant or Stratus Server System, where the motherboard architecture makes these components redundant, might also be considered as part of the design process.

The following figure illustrates a redundant storage architecture through the removal of single points of failure, and outlines the logical storage design employed by the sample use cases.