Strange problem with network [TCP], netperf CRR test fails.
Hello,
I have 6 identical physical machines in one cluster with Debian 6.0
onboard . Initially they were used to run Cassandra nodes, but these
nodes started to go down randomly after several hours of work, with hung
up connections in CLOSE_WAIT state. Typically, CLOSE_WAIT state is
indicator of incorrect app behavior, but I've reproduced similar
symptoms with netperf CRR test even with host as localhost:
'netperf -H localhost -t TCP_CRR -l -5' results in
'TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to localhost (127.0.0.1) port 0 AF_INET : demo
send_tcp_conn_rr: data recv error: Connection reset by peer'
And connections connection hang up in CLOSE_WAIT state with strange 1
byte in Recv-Q:
'tcp 1 0 127.0.0.1:12865 127.0.0.1:39664 CLOSE_WAIT'
Though, if I set test duration in seconds (e.g. -l 5) it works
correctly, and TCP_RR works correctly all the time.
Also, I've made tcpdump of conversation between two nodes in similar
TCP_CRR test and it also looks strange. Nodes correctly open connection
'client' send its data and then 'server' side just resets connection.
'netstat -s' for 40 minutes of uptime(reboot, test, and writing this
message) shows suspicious '6 TCP data loss events' and '11
connections reset due to early user close':
Ip:
2645347 total packets received
76 with invalid addresses
0 forwarded
0 incoming packets discarded
2645271 incoming packets delivered
2636980 requests sent out
Icmp:
22 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 22
22 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 22
IcmpMsg:
InType3: 22
OutType3: 22
Tcp:
263419 active connections openings
263458 passive connection openings
0 failed connection attempts
62 connection resets received
1 connections established
2636459 segments received
2636437 segments send out
8 segments retransmited
0 bad segments received.
21 resets sent
Udp:
531 packets received
2 packets to unknown port received.
0 packet receive errors
553 packets sent
UdpLite:
TcpExt:
9 invalid SYN cookies received
264883 TCP sockets finished time wait in fast timer
3 time wait sockets recycled by time stamp
20 delayed acks sent
Quick ack mode was activated 1 times
264978 packets directly queued to recvmsg prequeue.
473 bytes directly in process context from backlog
265473 bytes directly received in process context from prequeue
69 packet headers predicted
1573 packets header predicted and directly queued to user
1055284 acknowledgments not containing data payload received
193 predicted acknowledgments
6 TCP data loss events
1 timeouts in loss state
5 retransmits in slow start
2 other TCP timeouts
2 DSACKs sent for old packets
11 connections reset due to early user close
TCPSackMerged: 7
TCPSackShiftFallback: 13
I've already upgraded 'ixgbe' driver upto the latest 3.9-NAPI, but
problem still persists. And I even cannot find out it's source.
Best regards,
Anatoly Rybalchenko
Reply to: