[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

aptitude blocking SIGWINCH when restarting sshd



A number of times in the past, I've run into problems where remote
systems were doing bad things wrt window resizing.  Basically, they'd
stop responding to resizes.  This can be really annoying.

Today I had a system of mine do that to me, and I think I tracked down
why.  Whenever I've googled for this in the past, I haven't really been
able to get anywhere with it, so this post is mostly an attempt to
record what I found in hopes that it will help future searchers (maybe
even myself a couple of months from now).

I remember before that when I'd had this problem, restarting sshd seemed
to clear it up.  Not so today.  I knew this had something to do with
SIGWINCH not being propagated correctly.  I knew it worked most of the
time; indeed it had been working fine on my machine for months, but just
stopped working today.  I had other (remote) systems that seemed to
never work, and I didn't really know why.

Today my searching did come up with 'ps s', which shows which signals a
process has blocked.  Of course, my searching didn't give any clues on
how to read it, but I figured it out.

Here's what I found:

pescado1:~# ps s -C sshd
  UID   PID   PENDING   BLOCKED   IGNORED    CAUGHT STAT TTY        TIME COMMAND
    0 31900  00000000  08000000  00001000  80014005 Ss   ?          0:14 /usr/sbin/sshd
    0 19689  00000000  08000000  00001000  80012000 Ss   ?        204:13 sshd: root    
    0 26486  00000000  08000000  00001000  80012000 Ss   ?          0:00 sshd: root@pts/1

See how those say 08000000 in their BLOCKED columns?  That actually corresponds
to SIGWINCH being blocked.  How do you know?

run 'kill -l', and it gives you a list of 64 signals.  I couldn't really
make sense of those 8-character singal lists just yet.  Then I went and
looked it /proc/$$/status.  Now they were shown as 16-character strings
(with a bunch of leading zeroes).  Those smelled a lot like 64-bit
numbers.  Indeed, it turns out that they do represent a bit for each of
64 signals.  So 08000000 in the BLOCKED column refers to signal 28
(SIGWINCH) being blocked. [1]

As you can see, since sshd has that signal blocked, so do all of its
child processes.  So even if I go to /etc/init.d/ssh restart, that new
sshd begins as a child process of my current shell, which also has
signal 28 blocked!  I guessed that restarting it from the console would
work, and it did:

vineet@quesadilla:~$ ps s -C sshd
  UID   PID   PENDING   BLOCKED   IGNORED    CAUGHT STAT TTY        TIME COMMAND
    0  6117  00000000  00000000  00001000  80006001 Ss   ?          0:00 sshd: davidb [priv]
 5008  6119  00000000  00000000  00001000  80012000 S    ?          0:02 sshd: davidb@pts/7
    0 28555  00000000  08000000  00001000  80004001 Ss   ?          0:00 sshd: danh [priv]
 5054 28557  00000000  08000000  00001000  80010000 S    ?          0:00 sshd: danh@pts/1
    0 28609  00000000  00000000  00001000  00014005 Ss   ?          0:00 /usr/sbin/sshd
    0 28617  00000000  00000000  00001000  80004001 Ss   ?          0:00 sshd: vineet [priv]
 5025 28619  00000000  00000000  00001000  80010000 S    ?          0:00 sshd: vineet@pts/5

So you can see that old processes still have their old signal blocks,
but the new sshd (started from the console) doesn't, and nor do new
child processes from there.  Sure enough, window resizes in new sessions
work just fine.

OK, so why did that happen just now?  Well, I had just updated ssh from
within aptitude.  I decided to watch aptitude more closely.  Here I
fired it up, and I see this:

vineet@quesadilla:~$ ps s -C aptitude
  UID   PID   PENDING   BLOCKED   IGNORED    CAUGHT STAT TTY        TIME COMMAND
    0 19809  00000000  00000000  00001000  88084426 S+   pts/4      0:01 aptitude


Looks fine so far... then I hit 'g'.  This time there aren't any packages to
install (it just shows me some packages being held back.)  ps again ... still
looks the same.  Give aptitude another 'g', and it pops up a dialog:
"Downloaded 0B in 0s (0B/s)".  ps again:

vineet@quesadilla:~$ ps s -C aptitude
  UID   PID   PENDING   BLOCKED   IGNORED    CAUGHT STAT TTY        TIME COMMAND
    0 19809  00000000  08000000  00001000  88084426 S+   pts/4      0:01 aptitude

Hey!

I tell aptitude to go ahead and continue with package installation and I
keep running ps in my other window.  I see that sig 28 stays blocked
until it's done.  So any time it upgrades or installs a daemon and
invokes its rc script to start it, those start out with SIGWINCH
blocked.  This was why it was always hosed on my remote machines at the
colo; I had updated their sshd from within aptitude a long time ago.
Now I don't really have another way in (short of driving to the colo and
plugging in a console), so all ssh sessions on those systems are
basically hosed (wrt SIGWINCH processing) until they get rebooted.

So my remaining questions are: (1) should this be considered an aptitude
bug? and (2) Is there an easy way for me to unblock that signal in a
shell, so that I can then restart ssh from within that shell to have it
start with a clean slate?  Actually, if I could figure out an easy way
to do it within a running shell, I'd probably go ahead and put that in
the init script, so that future aptitude updates wouldn't be able to
re-hose it. (3) If not an easy way to do it in a shell, how about from a
system call within a process?  and finally (4) if so, is it a bug that
sshd doesn't explicitly unblock SIGWINCH when starting up (or at least
provide an option to do so)?

All followup comments and questions are welcome.

good times,
Vineet

[1]  Here's some more detail on how to read those numbers:  Each
character is a "hexit" that represents 4 bits.  So for example, 00000001
corresponds to just bit #1, and hence signal #1.  80000000 refers to bit
32, and hence signal 32.  8000001 is the addition of those two, and
indicates that signals 32 and 1 are both blocked.  08000000 refers to
signal 28 since only the 28th bit from the right is a 1, and all the
rest are zeroes.

-- 
http://www.doorstop.net/
-- 
#include<stdio.h>
int main() {
    puts("Reader! Think not that \n"
         "technical information \n"
         "ought not be called speech;");
    return 0;
}

Attachment: signature.asc
Description: Digital signature


Reply to: