Switching is not working on VMware NSX

In this scenario we have three VMs deployed on a NSX vSwitch with VNI 5002. Two are running on the same hosts and they can ping each other, the other one is running on a separated host and cannot ping the other VMs.

Check if at least one interface is configured for VXLAN:

~ # esxcli network vswitch dvs vmware vxlan list
VDS ID                                           VDS Name   MTU  Segment ID   Gateway IP   Gateway MAC        Network Count  Vmknic Count
-----------------------------------------------  --------  ----  -----------  -----------  -----------------  -------------  ------------
6f 8c 2f 50 90 23 85 a6-a5 36 18 fa 99 4f ba c7  DSwitch0  1600  172.31.30.0  172.31.30.1  00:23:7d:34:9a:da              1             1
~ # esxcfg-vmknic -l
Interface  Port Group/DVPort   IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type
vmk0       VMkernel            IPv4      172.31.30.11                            255.255.255.224 172.31.30.31    00:1b:78:b8:13:fc 1500    65535     true    STATIC
vmk0       VMkernel            IPv6      fe80::21b:78ff:feb8:13fc                64                              00:1b:78:b8:13:fc 1500    65535     true    STATIC, PREFERRED
vmk1       57                  IPv4      172.31.30.13                            255.255.255.224 172.31.30.31    00:50:56:60:8c:ad 1600    65535     true    DHCP
vmk1       57                  IPv6      fe80::250:56ff:fe60:8cad                64                              00:50:56:60:8c:ad 1600    65535     true    STATIC, PREFERRED

First step is check if VXLAN connectivity. On each ESXi hosts try to ping remote ESXi hosts:

ping ++netstack=vxlan -I vmk1 172.31.30.9

vmk1 is the VXLAN interface, 172.31.30.9 is the VXLAN interface of the remote host. If it works check the MTU:

ping ++netstack=vxlan -d -s 1572 -I vmk1 172.31.30.9

1572 is the size required by VXLAN. If it doesn’t work, check physical switch for configured MTU.

If the above tests are OK, connect to the controller:

nsx-controller # show control-cluster logical-switches vni 5002
Error: Not found

Seems VNI 5200 is not configured even if the Web Client shows it. Go back to ESXi hosts:

~ # net-vdl2 -l
VXLAN Global States:
        Control plane Out-Of-Sync:      No
        UDP port:       8472
VXLAN VDS:      DSwitch0
        VDS ID: 6f 8c 2f 50 90 23 85 a6-a5 36 18 fa 99 4f ba c7
        MTU:    1600
        Segment ID:     172.31.30.0
        Gateway IP:     172.31.30.1
        Gateway MAC:    00:23:7d:34:9a:da
        Vmknic count:   1
                VXLAN vmknic:   vmk1
                        VDS port ID:    57
                        Switch port ID: 50331675
                        Endpoint ID:    0
                        VLAN ID:        0
                        IP:             172.31.30.13
                        Netmask:        255.255.255.224
                        Segment ID:     172.31.30.0
                        IP acquire timeout:     0
                        Multicast group count:  0
        Network count:  1
                VXLAN network:  5002
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled ()
                        Controller:     172.31.30.19 (down)
                        MAC entry count:        0
                        ARP entry count:        0
                        Port count:     1
~ # esxcli network vswitch dvs vmware vxlan network list --vds-name=DSwitch0
VXLAN ID  Multicast IP               Control Plane  Controller Connection  Port Count  MAC Entry Count  ARP Entry Count  MTEP Count
--------  -------------------------  -------------  ---------------------  ----------  ---------------  ---------------  ----------
    5002  N/A (headend replication)  Enabled ()     172.31.30.19 (down)             1                0                0           0

The controller is not connected to the ESXi host. Check the configuration file:

~ # cat /etc/vmware/netcpa/config-by-vsm.xml

And try to restart the netcpad service:

~ # /etc/init.d/netcpad restart

Check logs:

~ # cat /var/log/netcpa.log
[...]
2015-01-09T09:50:32.291Z [FFC8FB70 error 'Default'] Vxlan: notification from controller error/sub:1/2
[...]

On controller:

nsx-controller # show log cloudnet/cloudnet_java-vnet-controller.20150109-095816.4763.log filtered-by 5002
[...]
2015-01-09 11:17:15,617 4738815 [vxlan worker 0] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join not existed VNI 5002 by Connection [ip=172.31.30.11, cnnId=227]
[...]

Seems that is missing something between controllers and NSX manager. On NSX Manager. In this case the solution was delete the logical switch and add it again (need to unconfigure connected VMs before). After that:

nsx-controller # show control-cluster logical-switches vni 5002
VNI      Controller      BUM-Replication ARP-Proxy Connections VTEPs
5002     172.31.30.19    Enabled         Enabled   0           0

And the ESXi hosts are connected to the controller (connect VMs to the logical switch before):

~ # net-vdl2 -l
VXLAN Global States:
        Control plane Out-Of-Sync:      No
        UDP port:       8472
VXLAN VDS:      DSwitch0
        VDS ID: 6f 8c 2f 50 90 23 85 a6-a5 36 18 fa 99 4f ba c7
        MTU:    1600
        Segment ID:     172.31.30.0
        Gateway IP:     172.31.30.1
        Gateway MAC:    00:23:7d:34:9a:da
        Vmknic count:   1
                VXLAN vmknic:   vmk1
                        VDS port ID:    57
                        Switch port ID: 50331675
                        Endpoint ID:    0
                        VLAN ID:        0
                        IP:             172.31.30.13
                        Netmask:        255.255.255.224
                        Segment ID:     172.31.30.0
                        IP acquire timeout:     0
                        Multicast group count:  0
        Network count:  1
                VXLAN network:  5002
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled (multicast proxy,ARP proxy)
                        Controller:     172.31.30.19 (up)
                        MAC entry count:        0
                        ARP entry count:        0
                        Port count:     1

And now ping works between VMs located on different hosts:

nsx-controller # show control-cluster logical-switches mac-table 5002
VNI      MAC               VTEP-IP         Connection-ID
5002     00:50:56:af:32:9b 172.31.30.13    33854
5002     00:50:56:af:49:3a 172.31.30.9     33853
5002     00:50:56:af:1b:18 172.31.30.9     33853

References

Posted on 12 Jan 2015 by Andrea.
  • Gmail icon
  • Twitter icon
  • Facebook icon
  • LinkedIN icon
  • Google+ icon