Bug#66919: wierd unkillable apt-get hang
Package: apt
Version: 0.3.10slink
here is apt-get -v
apt 0.3.11 for i386 compiled on Aug 8 1999 10:12:36
dye@TRANSAM: 20% uname -a
Linux transam 2.0.36 #5 Thu Jan 20 04:47:43 CST 2000 i586 unknown
Sorry for the poor data on this bug; I thought it was reproducible (it was,
several different attempts) but a second FSCK seemed to clean up the problem.
From a remote system via telnet, was running dselect to install a single
package, zip. When I performed the <install> operation, the telnet session
hung. Further attempts to telnet to the machine failed, as well as all
ftp, www, smtp attempts. Ping however, succeeded.
When I got home, was greeted with a blank screen; virtual terminals did
not work.
A power cycle brought up an FSCK that required a manual FSCK which revealed
filesystem damage, including /var/cache/apt inodes, and if I remember
correctly, /var/lib/dpkg/cache stuff.
I tried to continue the dselect process, but it hung the first thing I
tried, which was install, I believe. The machine did not seize up as
before, but top revealed that apt-get -update was taking 98% of the
CPU time.
I let this run for several minutes, then tried "strace" on the apt-get
pid to try to see what it was hung on. No output from strace, so no
system calls were being made; it was in a tight infinite loop.
Tried killing the apt-get pid, all attempts from -3, -15 and even -9 were
unsuccessful and yielded no strace output either. Tried
"gdb /usr/bin/apt-get pid#", which worked but hung as well.
A reboot of the system again forced a manual FSCK, which brought up
more cache directory errors. I tried another dselect (had to remove a lock
file) which ran apt-get which again hung, so as I thought it was reproducible
and getting kind of late, I went to bed and left the problem alone for a
couple
of days. I searched your bug list and found nothing of this sort, started
to file it and get more info by reproducing the problem, but a power failure
or stupid-brother-in-law reboot seems to have cleared up the problem.
Could you get the package-maintainer to look at this? My guess is the
cache-scanning code is under interrupt lock (though I don't know how kill
-9 didn't work) and getting hung by a corrupted cache.
I consider it a huge bug that the process did not respond to interrupts;
I realize that you don't want to leave a FUBAR'ed cache laying around but
think it would be better on more severe interrupts (QUIT or better) to
mark the cache invalid and exit...
--Ken "dye@stg.com" Dye
Reply to: