Analyzing a network performance issue

Abstract

In this case application team is reporting a random high RTT between a client and a server. Both client and server are Linux virtual machines, under VMware vSphere, and path is through lot of firewalls, routers, switches, load balancers... and so on.

The issue

The very first thing is replicate the issue from the client:

client# PORT=6000; while true; do echo -ne "${PORT} "; nmap -n -g ${PORT} -p 143 10.0.1.1 | grep scanned; PORT=$((${PORT} + 1)); done

After a while I could see some requests going over one second:

6000 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6001 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6002 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6003 Nmap finished: 1 IP address (1 host up) scanned in 0.116 seconds
6004 Nmap finished: 1 IP address (1 host up) scanned in 0.118 seconds
6005 Nmap finished: 1 IP address (1 host up) scanned in 3.861 seconds
6006 Nmap finished: 1 IP address (1 host up) scanned in 0.116 seconds
6007 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6008 Nmap finished: 1 IP address (1 host up) scanned in 0.128 seconds
6009 Nmap finished: 1 IP address (1 host up) scanned in 0.113 seconds

The issue is now confirmed. Before continue, be sure client and server have both clock synchronized.

client# clock
Fri 05 Feb 2016 12:28:57 PM CET  -0.016003 seconds
server# clock
Fri 05 Feb 2016 12:26:57 PM CET  -0.347022 seconds

Supposing the network stale and deterministic, the next step will require to check the end-to-end path on both client and server:

client# tcpdump -i eth0 -s 0 -nn -w /tmp/dump.pcap host 10.0.2.2 and portrange 6000-6999 and host 10.0.1.1 and port 143
server# tcpdump -i eth0 -s 0 -nn -w /tmp/dump.pcap host 10.0.2.2 and portrange 6000-6999 and host 10.0.1.1 and port 143

The source port is set to a fixed number so the tcpdump filter can be configured to match test requests only.

After a couple of requests with high latency, both tcpdump can be stopped and analyzed.

Let’s review some high latency requests:

6007 Nmap finished: 1 IP address (1 host up) scanned in 3.285 seconds
6089 Nmap finished: 1 IP address (1 host up) scanned in 2.911 seconds
6200 Nmap finished: 1 IP address (1 host up) scanned in 4.025 seconds

Let’s dump the request with source port 6088 on client:

client# tcpdump -r /tmp/dump.pcap -nn "port 6200 and tcp[tcpflags] & (tcp-syn) != 0"
reading from file /tmp/dump.pcap, link-type EN10MB (Ethernet)
12:35:02.639543 IP 10.0.2.2.6200 > 10.0.1.1.143: S 2917316624:2917316624(0) win 2048 <mss 1460>
12:35:02.640273 IP 10.0.1.1.143 > 10.0.2.2.6200: S 2145376395:2145376395(0) ack 2917316625 win 5840 <mss 1460>
server# tcpdump -r /tmp/dump.pcap -nn "port 6200 and tcp[tcpflags] & (tcp-syn) != 0"
reading from file /tmp/dump.pcap, link-type EN10MB (Ethernet)
12:35:02.635480 IP 10.0.2.2.6200 > 10.0.1.1.143: S 2917316624:2917316624(0) win 2048 <mss 1460>
12:35:02.635488 IP 10.0.1.1.143 > 10.0.2.2.6200: S 2145376395:2145376395(0) ack 2917316625 win 5840 <mss 1460>

So, we have the following timeline:

  • 12:34:29.573249 client sends a SYN packet
  • 12:34:29.569671 server receives the SYN packet
  • 12:34:29.569686 server send back a SYN+ACK packet
  • 12:34:29.574055 client receive the SYN+ACK packet

Beside a slight clock misalignment, all packets was sent and received back in less than 3 milliseconds. So network is not the cause of the latency. The cause is the source host, spending time to manage processes.

Posted on 05 Feb 2016 by Andrea.
  • Gmail icon
  • Twitter icon
  • Facebook icon
  • LinkedIN icon
  • Google+ icon