[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

#502444 (ssh SIGTERM/SIGHUP) workaround investigations



Colin and I discussed #502444 on IRC.

This is in the context of a failure in the Xen Project test system.
The failure is that during bootup the machine goes rapidly through
phases where 1. nc 22 fails; 2. nc 22 succeeds; 3. ssh fails getting
ECONNREFUSED; 4. ssh works.  Our test system occasionally notices 3
and declares a failure.

I observed log entries showing ssh being restarted unexpectedly during
boot.  I think this is because the IPv4 and IPv6 addresses are racing
to be configured.

14:57 <Diziet> cjwatson: So FYI my test case is: attempt to start the
               guest.  Wait for it to have an IPv4 address (there is
               no DHCPv6 or v6 RAs) and responding to ping.  Wait for
               it to be listening on its port 22 (as determined by
               nc).  Attempt to ssh to it and run a simple command.
               Test passes if all that works.

That was with /etc/network/if-up.d/openssh-server changed to use
`reload' rather than `restart'.

20:18 <Diziet> cjwatson: FYI that reboot with ssh reload test failed
               after 98 iterations, not because anything went wrong
               with the reload, but because reload apparently also
               closes and reopens the sockets.

20:18 <Diziet> When I found the failure I was able to ssh into the
               guest and look at its log which looks entirely as
               expected.

20:18 <Diziet> The only problem seems to have been the same transient
               ECONNREFUSED as before.

20:19 <Diziet> cjwatson: For now I have given the machine back to the
               test pool.  If 98 iterations is not enough to convince
               you that #502444 is gone in wheezy, I can run more...

Below is an extract of relevant log messages.  The first SIGTERM is
from the shutdown.  At 15:48 sshd comes up during boot but it is soon
sent a HUP.  "Did not receive identification string" is due to nc; I
think that connection came in before the HUP - I saw similar late
messages about "Did not receive identification string" when we were
restarting, where the pid in the message demonstrated that it was
forked from the pre-restart sshd.

Sep 25 15:47:43 debianhvm sshd[2464]: Received signal 15; terminating.
Sep 25 15:48:37 debianhvm sshd[2363]: Server listening on 0.0.0.0 port 22.
Sep 25 15:48:37 debianhvm sshd[2363]: Server listening on :: port 22.
Sep 25 15:48:37 debianhvm sshd[2363]: Received SIGHUP; restarting.
Sep 25 15:48:37 debianhvm sshd[2498]: Server listening on 0.0.0.0 port 22.
Sep 25 15:48:37 debianhvm sshd[2498]: Server listening on :: port 22.
Sep 25 15:48:37 debianhvm sshd[2487]: Did not receive identification string from 172.16.144.3
Sep 25 19:15:48 debianhvm sshd[2897]: Accepted publickey for root from 172.16.144.1 port 56815 ssh2
Sep 25 19:15:48 debianhvm sshd[2897]: pam_unix(sshd:session): session opened for user root by (uid=0)

Log from the test controller:

2015-09-25 15:48:37 Z guest debianhvm.guest.osstest 5a:36:0e:7d:00:01 22 link/ip/tcp: ok. (31s)
2015-09-25 15:48:37 Z executing ssh ... root@172.16.146.211 echo guest debianhvm.guest.osstest: ok
ssh: connect to host 172.16.146.211 port 22: Connection refused
2015-09-25 15:48:37 Z command nonzero waitstatus 65280: timeout 60 ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=100 -o ServerAliveInterval=100 -o PasswordAuthentication=no -o ChallengeResponseAuthentication=no -o UserKnownHostsFile=tmp/t.known_hosts_62333.test-amd64-amd64-xl-qemuu-debianhvm-amd64 root@172.16.146.211 echo guest debianhvm.guest.osstest: ok
status 65280 at Osstest/TestSupport.pm line 402.
./ts-guest-start: status 65280

The message `22 link/ip/tcp: ok.' means that nc to port 22 connected
successfully.  The 31s is the total time it waited while polling first
with ping and then with nc.

(All times UTC.)


If this is convincing, then maybe the workaround in #502444 can be
reverted.

However, it appears that sshd closes and reopens all its sockets on
SIGHUP, even if there is no change.  I think this is probably
undesirable.  But I think I am going to have to work around it,
anyway.

Ian.


Reply to: