Load balancing on VMware Virtual Standard (vSwitch)

Two Linux VMs are running inside a vSphere host:

  • ubuntu1 (IP address 172.31.30.22, MAC address 00:50:56:AF:49:3A)
  • ubuntu2 (IP address 172.31.30.24, MAC address 00:50:56:AF:1B:18)

Both are pinging the default gateway (IP address 172.31.30.1, MAC address 00:23:7D:34:9A:DA) and a physical switch (IP address 172.31.30.2, MAC address 00:22:BE:9E:A2:40).

Now add a second physical network adapter to the vSwitch0:

add_uplink

Route based on originating virtual port

Default load balancing algorithm is “Route based on originating virtual port”, and each VM takes a different uplink based on Virtual Port ID:

Switch#show mac address-table | i 0050.56af.1b18|0050.56af.493a
   1    0050.56af.1b18    DYNAMIC     Gi1/0/3
   1    0050.56af.493a    DYNAMIC     Gi1/0/4

On the switch side no link aggregation group (LAG) is required. If a LAG will be configured, packets will be duplicated:

# ping 172.31.30.24
PING 172.31.30.24 (172.31.30.24) 56(84) bytes of data.
64 bytes from 172.31.30.24: icmp_seq=1 ttl=64 time=0.320 ms
64 bytes from 172.31.30.24: icmp_seq=1 ttl=64 time=0.362 ms (DUP!)

Route based on source MAC hash

Now switch to “Route based on source MAC hash”:

lb_source_mac

Both VMs are taking the same uplink:

Switch#show mac address-table | i 0050.56af.1b18|0050.56af.493a
   1    0050.56af.1b18    DYNAMIC     Gi1/0/3
   1    0050.56af.493a    DYNAMIC     Gi1/0/3

The reason is simple. The algorithm works as following:

used vmnic = HEX(SRC MAC Address) mod (available vmnics)

So:

0050.56af.1b18 mod 2 = 0
0050.56af.493a mod 2 = 0

In this specific case the default algorithm is more efficient than this one.

Again, do not configure LAG on the switch side or packets will be duplicated.

Route based on IP hash

Now change the algorithm to “Route based on IP hash”:

lb_iphash

Because this is the only algorithm which involves destination address, a static link aggregation group (LAG) must be configured on the switch side or the following error will occur:

Jun 20 13:42:01.295: %SW_MATM-4-MACFLAP_NOTIF: Host 0050.56af.493a in vlan 1 is flapping between port Gi1/0/3 and port Gi1/0/4
Jun 20 13:42:02.789: %SW_MATM-4-MACFLAP_NOTIF: Host 0050.56af.1b18 in vlan 1 is flapping between port Gi1/0/3 and port Gi1/0/4

Now both vmnics are bundled into a single 2 Gbit/s interface:

Switch#show mac address-table | i 0050.56af.1b18|0050.56af.493a
   1    0050.56af.1b18    DYNAMIC     Po2
   1    0050.56af.493a    DYNAMIC     Po2

Mind that:

  • data from VMs to the switch will be balanced using source and destination IP hash;
  • data from the switch to the VMs will be balanced using algorithm configured on the switch side (Cisco uses src-mac by default);
  • a single flow cannot exceed the available bandwidth of a single interface (in other words a single flow cannot transmit up to 2Gbit/s).

Use explicit failover order

The last choice is not a load balance algorithm, it’s just a failover/failback method:

lb_disabled

This algorithm, will always use the highest active adapter, and will failover to the following in case of failure. Standby adapter will be used if all active adapters will fail. In this case vmnic0 is the active interface under normal conditions. If vmnic0 fails, the vmnic1 takeover. After vmnic0 recovers it will become active again (Failback option). Failures are detected using network link status and failures will notify the events to the upstream (physical) switches using RARP (see “Notify switches” later”)

Notify switches

When a VM is moved (vMotion) to another host, or an active adapter fails, one or more MAC address are moved to another active adapter. Upstream physical switches must update their MAC Address table in order to make VMs reachable again. Notify switches feature broadcast RARP packets in order to refresh physical switches. RARP is an obsolete protocol, and it’s usually ignored by hosts. This is not an issue, because any exiting frame can update MAC address table switches. I suppose GARP would be a better choice but I suppose VMware had have good reasons.

Beacon Probing

The default detecting method of a network failure is network link status. In the following example link status is not enough:

access_core

If the link between Access #1 switch and Core switch will goes down, the ESXi active link remains in active state. With Beacon Probing:

ESXi/ESX periodically broadcasts beacon packets from all uplinks in a team. The physical switch is expected to forward all packets to other ports on the same broadcast domain. Therefore, a team member is expected to see beacon packets from other team members. If an uplink fails to receive three consecutive beacon packets, it is marked as bad. The failure can be due to the immediate link or a downstream link.

With Beacon Probing if Access #1 to Core link goes down, ESXi host can detect the failure and switch to the stand-by vmnic. Beacon Probing can help if Unidirectional Links happen too.

Beacon Probing must be used with three or more active adapter to prevent split-network scenarios. With only two adapter, if a single uplink fails without impacting on link status, hypervisor stops receiving beacon packets on both adapter and it cannot know which is failed. Seems that default behavior is notify the error and “do nothing”: some VMs will still able to reach outside networks, dome won’t:

beacon_probing_error

References

Posted on 20 Jun 2014 by Andrea.
  • Gmail icon
  • Twitter icon
  • Facebook icon
  • LinkedIN icon
  • Google+ icon