Re: Idle TCP connections freeze
Pascal Hambourg <pascal@plouf.fr.eu.org> writes:
> Hello,
>
> Nikolaus Rath a écrit :
>>
>> I'm having trouble with an internet connection that seems to randomly
>> "freeze" arbitrary tcp connections when they have not been used for a
>> while. The connections stay established, but no data is coming through.
>
> How long is "a while", at a minimum ?
I wrote a small test program. It seems to be exactly 302 seconds, 301
still works.
>> When this happens, netstat still shows the connection status as
>> `ESTABLISHED` on both the local computer:
>>
>> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
>> tcp 0 53 192.168.0.10:41129 173.255.235.238:143 ESTABLISHED 8219/gnutls-cli on (79.31/13/0)
>>
>> ..and the remote server:
>>
>> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
>> tcp 0 0 173.255.235.238:143 68.5.174.98:41129 ESTABLISHED 5303/imapd off (0.00/0/0)
>
> It appears that the client has a private addresse and the server has a
> public address. So I guess that there is a NAT device between them, and
> its stateful NAT engine may be the cause of the problem, by deleting
> connections from its translation table after a delay of inactivity.
>
>> When I look at a packet capture of this connection on the client side,
>> there is a long (expected) period of inactivity that seems to trigger
>> the problem, then the local end tries to transmit some data again but
>> never receives an ACK. Instead, 15 TCP Retransmissions go out, with
>> intervals increasing from 0.3 seconds to 120 seconds. No activity is
>> captured after that.
>
> Can you do a packet capture on the server side well ?
Yes, just tried it. The server does not receive anything at all when the
client starts retransmitting. I guess that is consistent with the NAT
explanation?
>> Does anyone have a suggestion of how I could debug this further to find
>> out where the problem lies and how to fix it?
>>
>> Also, is is there some way to globally reduce the timeout on client
>> and/or server to reduce the time before the local application aborts?
>
> The Linux kernel supports system-wide TCP keepalive. However the
> application must enable it on a per-socket basis, and the minimum
> recommended value of 2 hours (which is the default in Linux) is quite
> high, the inactivity timeout in your NAT device may be shorter. The best
> workaround for this is to generate traffic with some kind of
> application-level keepalive, either defined in the application protocol
> such as in SSH, of by periodically sending dummy commands or data.
Yes, I guess your NAT theory makes sense. If I use ssh with
"ServerAliveInterval", or force libkeepalive use with LD_PRELOAD, the
connections survive beyond 302 seconds.
However, unfortunately this isn't a good solution, because I have
non-Linux devices in the same network that suffer from the same problem.
Is there a way to figure out at which device the NAT timeout happens? I
have a Cisco DPC3825 cable modem that does NAT. But it has just 4
Ethernet connections and WLAN, so I have a hard time believing that it
would need to force a 5 min timeout. The web administration page also
doesn't mention any timeouts (which may of course mean nothing). Is it
possible that there's a second NAT at work behind the modem?
Thanks,
-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Reply to: