Architecting a vCloud Director Solution : Resource Groups : 6.3 Networking : 6.3.2 NSX Edge Cluster : 6.3.2.5 Design Option 3b – Dedicated Edge Cluster with ECMP Edges
   
6.3.2.5 Design Option 3b – Dedicated Edge Cluster with ECMP Edges
In the previous design option, there was one provider edge in the edge cluster for each transit vCloud Director external network to which Org VDC edge gateways are connected.
To provide access to a shared service (for example, the Internet) where multiple Org VDC edge gateways of different tenants are connected to the same external network, all external network traffic must go through a single provider edge, which can become a bottleneck.
VMware NSX Edge gateways can be deployed in an Equal Cost Multi-Path (ECMP) configuration where the bandwidth of up to 8 edges (8x10 GB = 80 GB throughput) can be aggregated. High availability of ECMP edges is achieved with a dynamic routing protocol (BGP or OSPF) configured with aggressive timing for short failover times (3 seconds) which will quickly remove failed paths from the routing tables.
The problem is that to take advantage of multiple paths, the tenant Edge Gateways must set up peering with the provider Edge to exchange routing information and availability of the paths. This is, however, not manageable in a shared environment where each newly deployed tenant Edge Gateway must have peering with the Provider Edge set up. The following design works around this limitation by deploying a distributed logical router (DLR) between provider and organization VDC edges. The DLR then provides a single distributed, highly available default gateway for all tenant Edge Gateways.
Figure 21. Leaf and Spine with Dedicated Edge Cluster and ECMP Edges
 
The previous figure shows two provider ECMP edges (can scale up to 8) with two physical VLAN connections each. These connections are to the upstream physical router and one internal interface to the transit edge logical switch. The DLR then connects the transit edge logical switch with the transit vCloud Director external network to which all tenant Org VDC edge gateways are connected. The DLR has ECMP routing enabled as well as OSPF or BGP dynamic routing peering with the provider edges. The DLR provides two (or more) equal paths to the upstream provider edges and chooses one based on a hashing algorithm of source and destination IP addresses of the routed packet.
The two shown Org VDC edge gateways (which can belong to two different tenants) then take advantage of all the bandwidth provided by the Edge cluster (indicated with the orange arrows).
The figure also depicts the DLR control VM. This is the protocol endpoint that peers with Provider Edges and learns and announces routes. Routes are then distributed to the VMware ESXi™ host VMkernel routing process by the VMware NSX Controller™ cluster (not shown in the figure). The failure of the DLR control VM has impact on routing information learned through OSPF/BGP protocols, even if the DLR is highly available in active standby configuration. This is due to the protocol aggressive timers (DLR control VM failover takes more than 3 seconds). Therefore, a static route is created on all ECMP provider edges for the transit vCloud Director external network subnet. That is enough for North-South routing, because Org VDC subnets are always behind the tenant Org VDC edge gateway that provides Network Address Translation (NAT). North-South routing is static because the Org VDC edge gateways are configured with a default gateway defined in the external network properties.
The other consideration is placement of a DLR control VM. If the VM fails together with one of ECMP Provider Edges, ESXi host VMkernel routes are not updated until DLR control VM functionality fails over to the passive instance. In the meantime, the route to the non-functioning provider edge is blackhole traffic. If there are enough hosts in the Edge cluster, deploy DLR control VMs with anti-affinity DRS rule to all ECMP edges. Most likely, there are not enough hosts, so deploy DLR control VMs to one of the compute clusters. The VMs are very small (512 MB, 1 vCPU). The cluster capacity impact is negligible.