[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Strange problem with network [TCP], netperf CRR test fails.



Hello,
I have 6 identical physical machines in one cluster with Debian 6.0
onboard . Initially they were used to run Cassandra nodes, but these
nodes started to go down randomly after several hours of work, with hung
up connections in CLOSE_WAIT state. Typically, CLOSE_WAIT state is
indicator of incorrect app behavior, but I've reproduced similar
symptoms with netperf CRR test even with host as localhost:
'netperf -H localhost -t TCP_CRR -l -5' results in 

'TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to localhost (127.0.0.1) port 0 AF_INET : demo
send_tcp_conn_rr: data recv error: Connection reset by peer'

And connections connection hang up in CLOSE_WAIT state with strange 1
byte in Recv-Q:

'tcp 1 0 127.0.0.1:12865 127.0.0.1:39664 CLOSE_WAIT'

Though, if I set test duration in seconds (e.g. -l 5) it works
correctly, and TCP_RR works correctly all the time.
Also, I've made tcpdump of conversation between two nodes in similar
TCP_CRR test and it also looks strange. Nodes correctly open connection
'client' send its data and then  'server' side just resets connection.

'netstat -s' for 40 minutes of uptime(reboot, test, and writing this
message) shows suspicious    '6 TCP data loss events' and '11
connections reset due to early user close':

Ip:
   2645347 total packets received
   76 with invalid addresses
   0 forwarded
   0 incoming packets discarded
   2645271 incoming packets delivered
   2636980 requests sent out
Icmp:
   22 ICMP messages received
   0 input ICMP message failed.
   ICMP input histogram:
       destination unreachable: 22
   22 ICMP messages sent
   0 ICMP messages failed
   ICMP output histogram:
       destination unreachable: 22
IcmpMsg:
       InType3: 22
       OutType3: 22
Tcp:
   263419 active connections openings
   263458 passive connection openings
   0 failed connection attempts
   62 connection resets received
   1 connections established
   2636459 segments received
   2636437 segments send out
   8 segments retransmited
   0 bad segments received.
   21 resets sent
Udp:
   531 packets received
   2 packets to unknown port received.
   0 packet receive errors
   553 packets sent
UdpLite:
TcpExt:
   9 invalid SYN cookies received
   264883 TCP sockets finished time wait in fast timer
   3 time wait sockets recycled by time stamp
   20 delayed acks sent
   Quick ack mode was activated 1 times
   264978 packets directly queued to recvmsg prequeue.
   473 bytes directly in process context from backlog
   265473 bytes directly received in process context from prequeue
   69 packet headers predicted
   1573 packets header predicted and directly queued to user
   1055284 acknowledgments not containing data payload received
   193 predicted acknowledgments
   6 TCP data loss events
   1 timeouts in loss state
   5 retransmits in slow start
   2 other TCP timeouts
   2 DSACKs sent for old packets
   11 connections reset due to early user close
   TCPSackMerged: 7
   TCPSackShiftFallback: 13

I've already upgraded 'ixgbe' driver upto the latest 3.9-NAPI, but
problem still persists. And I even cannot find out it's source.

Best regards,
Anatoly Rybalchenko


Reply to: