The integration of devices made by different vendors is a delicate task, because each component can behave in a different way than what is expected. A data center core network built with Cisco, HP, NetApp and VMware equipment/software will be discussed:
The previous diagram does not show the best data center implementation: it just show a data center implementation for a discussion about limits, risks and solutions.
Switching within the physical core-edge network
The physical network is built with Cisco equipment:
- 2 Cisco Catalyst 6500 (no VSS) will implement the Core level;
- 2 Cisco Nexus 5K will implement the Distribution level;
- 2 (or more) Cisco Nexus 2000 (Fabric Extender) will implement the Access level (HP Virtual Connects will implement an additional Access level and it will be discussed later).
The STP is a required protocol to avoid loop in layer two networks. During a topology change a VLAN can be frozen up to 50 seconds. Rapid STP (802.1w) can be used to shorten the STP toplogy change up to 30 seconds, not less. In this scenario, a lot of Virtual Machines are running using a NFS datastore: each VM boots from a virtual disk, formally a file stored in a NFS datastore. Each virtual disk emulates a SCSI disk, so the Virtual OS can send SCSI commands to the virtual disk. A SCSI command needs a loss-less media: it can’t be delayed, it can’t be lost (the error recovery process takes too long). So a topology change will stop all SCSI commands of hundreds VM: all OSes will crash or became unusable (Windows will crash with a blue screen and Linux will remount all filesystems into a read-only mode). There are Virtual Machine level fixes to avoid this behavior, and they will be discussed later.
STP could be almost completely eliminated configuring a VSS between the two Catalyst switches, but this discussion is not about VSS. So the down effect caused by a STP convergence must be minimized.
Each Catalyst will configure a vPC link to both Nexus 5K. CoreA is Root (STP) for the NFS network, and CoreB is Root for the other VLANs. N5k should not be Root because they present themselves with two different Bridge ID: if the lowest BID will reboot, then all VLANs need to converge. Both Catalyst switches have two Supervisors, so each Root is more stable.
Both Catalyst will act as gateway for routed VLANs (management and production). The DMZ VLAN is routed by a Firewall (not shown in the diagram). vMotion and NFS VLANs are isolated. A NHRP (Next Hop Resolution Protocol) can be configured between both catalyst: primary gateway should follow the STP Root.
Each server has a teaming (Windows) or bonding (Linux) mechanism to overcome a single path failure. A path will become active and the other one will remain standby. Depending by the driver an active path can configured as preferred. In this scenario no preferred path is assumed, so the worst case scenario will be analyzed. Each NetApp storage can be configured with an active-standby path mechanism or with a LACP portchannel.
The path between each couple of host should be analyzed:
- From ServerA to ServerC: the active path of both servers is attached to the same Fabric Extender (Nexus 2000) which is not a switching device. All frame must be forwarded to EdgeA switch and then back to the destination. The uplink path of Fabric Extender devices can be a bottleneck.
- From ServerA to ServerB: the active path of both servers is attached to different Fabric Extenders which forward frames to the upper Edge switch. Analyzing the STP topology, the link between edge switches should be in Alternate state (rapid-pvst is used): frames should flow to the Root Catalyst switch, than back to the destination Nexus 5K. But, because vPC keepalive links are always non-blocking link (a vPC domain use other mechanism to ensure a loop fre topology), frames flow through the vPC keepalive link, using the shortest path. vPC keepalive link can also be a bottleneck.
- Between NetApp storages: traffic between NetApp depends by the interface configuration (active-standby or active-active using LACP). As discussed in the previous point, the shortest path will be used. Because storage traffic can heavily impacts on the network, NetApp should be directly attached to the Nexus 5K switches rather than Fabric Extenders, otherwise N2K-N5K links will be over used (N2K are non-swiching devices).
- From a server to a NetApp storage: an active-standby NetApp configuration follows the ServerA ServerB example. A LACP NetApp configuration behave in a different way: the storage will present its MAC address to both N5K switches. ServerA will use the active path finding the destination storage within the nearest N5K switch. NetApp LACP configuration allows server to reach it using always the shortest path. On the other side an active-standby configuration allow to move cables to other switches without a Takeover/Giveback action.
N2K switches are not switches, they are just extension (Fabric Extender) for the N5K/N7K switches with important features/limitations:
- they are non-switching devices;
- they provide end host connectivity only.
No switch can be attached to a N2K device: each N2K has an implicit BPDU guard configured: if a BPDU is received, the port is immediately placed in an error-disabled state which keeps the link down.
The additional access layer: HP Virtual Connects
HP Virtual Connects (VCs) are hybrid devices and can be compared to Cisco UCS 61xx devices: VC virtualize server information like UCS do. This is not a discussion between VC and UCS 61xx. The comparison was made only to point out what kind of devices VCs are. Each HP blade is attached (VC downlinks) to both VCs located in the same chassis: each VCs define WWNs and MAC addresses for servers using profiles. Each profile can be moved/cloned to other servers. A bunch of trunk links from upper Nexus layer brings all VLANs to VCs (VC uplinks). Each chassis must include a couple of VCs, which communicate through internal cross-connects 10GbE links. VCs on different chassis are connected using external fibers and form one VC domain. Each HP blade is connected to both VCs located in the same chassis, with internal hardware connections.
Within the same VC domain, traffic is forwarded internally. In other words
- within a VC if both blades are using the links to the same VC;
- within the VC domain using loop connection, if blades are using links to different VCs.
A “Shared uplink set” is a group of VC uplinks; in “Auto” mode the Shared uplink set negotiates a LACP port-channel. Each blade can use one or more Shared uplink set to transmit and receive data. From the OS perspective, a single Shared Uplink set is a network interface.
LACP port-channel can be configured withing the same VC only.
With Smart Link features each Shared uplink set brings blade network connections down, if there are no more VC uplinks active in the same Shared uplink set. See the following example:
In the left most diagram blade servers are configured with a bonding/teaming device which holds two Shared Uplink set. Each Shared Uplink set has only one VC Uplink which connecs to the External Networks. The bonding/teaming active path is indicated by the black arrow.
In the center diagram a VC Uplink is broken and a Shared Uplink set became unusable. But blade servers still use the same path because it is in a up/up state: the data path is broken and blade servers are unreachable.
In the right most diagram the Smart Link feature is enabled and it shutdown the unusable links. Blade servers can recognise the broken link and the active paths now are using the other Shared Uplink set (thanks to the bonding/teaming feature).
Now see the following more complex example:
In the above configuration each Virtual Connect configure one Shared Link set with a single VC Uplink inside it. The right most blade servers are using a remote Shared Uplink Set as the active path. If the Virtual Connect B Uplink goes down, the Smart Link notify direct connected blade servers, but it won’t notify other Virtual Connects about the broken link. So the right blade server will still use the same link and will became unreachable.
A Shared Uplink set can contains interface from different Virtual Connects, but this is outside the scope of this document.
VC behave as an edge device on the network and won’t transmit BPDUs: it will not participate in the spanning tree process. There is a configured BPDU Guard by default, and a receiving BPDU from blade server will shutdown that blade server internal link. There is no BPDU Filter feature.
This is a good thing, because if a BPDU will be forwarded to the N2K, it will shutdown the entire VC Uplink because of the implicit BPDU Guard on N2K.
The end host network access is provided by VMware software. Physically VMware vSphere is a software installed on servers: inside there is a complex system which emulates an entire data center.
A VMware Virtual Network (vNet) does not run STP and does not participate in any Spanning Tree process. Let see the following example:
The red VM is transmitting BPDU:
- The first BPDU is forwarded through the blade server where the VM is active. When the BPDU reach the Virtual Connect A the link is shutdown because of the BPDU Guard.
- The VMware ESXi server links the red VM to the other active link (trough the Virtual Connect B).
- Another BPDU will cause the shutdown of the second link. Now the first VMware ESXi server is unreachable.
- The VMware HA feature will restart all VM to another server.
- The red VM is active again and can transmit another BPDU.
- The process will continue until all servers will be unreachable.
A VM can trasnmit a BPDU if:
- Two vmnics are configured and bridged together: in this case if a BPDU (originated by the N5K) is received rom one vmnic, it will be forwarded through the other vmnic.
- A STP is running within the VM.
The first case can be avoided configuring to reject Forged Transmit under each vmnet:
The second case cannot be avoided using VMware ESXi 4.1. The solutions is: install Cisco Nexus 1000v or upgrade to VMware ESXi 5.1 which has a BPDU Filter feature.
As discussed before, another potential issue is caused by STP convergence: because each Virtual Machine use SCSI protocol, a packet loss/delayed can be potentially dangerous. To avoid crash during a STP convergence is mandatory increase SCSI and NFS timeout.
On Windows (persistent):
On Linux (non persistent):
echo 180 > /sys/block/sdc/device/timeout
On Linux systems, VMware Tools also set the proper timeout (persistent on reboot).
ESXi servers must be configured also with proper NFS settings. The easiest way is install Storage Virtual Console by NetApp and check the Netapp tab from vSphere Client:
Building a data center requires a deeply knowledge. This example showed:
- Cisco Fabric Extender (Nexus 2K) is a non-switching device with an implicity BPDU Guard configured.
- A Cisco Nexus 5K vPC acts as a single switch for a Portchannel, but originates two different Bridge IDs.
- HP Virtual Connects have a BPDU Guard by default and a Smart Link feature to notify link problems to blade servers. Smart Link is not forwarded to other Virtual Connects (local significance only).
- A VMware Virtual Machine can originate BPDU which are forwarded to other network devices.
- NFS datastores transport SCSI frames using a non-reliable media (frames can be delayed by STP).
- Dear VMware, BPDU Filter != BPDU Guard
- ESXi 5.1 and BPDU Guard
- HP Virtual Connect: Common Myths, Misperceptions, and Objections
- HP Virtual Connect Smart Link
- Overview of HP Virtual Connect technologies
- Cisco Nexus 2000 Series Fabric Extender Software Configuration Guide, Release 4.1
- Cisco Nexus 2000: A Love/Hate Relationship
- HP Virtual Connect for the Cisco Network Administrator