[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Strange boot-time hang (long, sorry)




  Greetings all --

  I am having difficulty diagnosing a strange boot-time  hang on 
one of my systems.  I have googled around for clues, but I'm 
coming up empty.

  The system is a file server, it's a dual-core Opteron machine,
4G RAM, root FS on a software RAID1 array and the served files
on a 3ware RAID5 system with a capacity of about 2.6 TB.  The
important services are ssh, nfsd, and nfs, but it's also a 
CUPS server.

  Several days ago, this machine crashed, probably due to a disk
failure in the RAID1 root-fs array, and subsequent to the disk 
replacement, was not able to reboot fully into run-level 2, instead
always hanging somewhere in the /etc/rc2.d/S20* part of the start-up
sequence.

  The hang-up seemed, from available error messages, to be related 
to the serial console on ttyS0, so I removed that from the boot line
and from /etc/inittab, but even so, found I was only able to get the
machine into single-user mode.  It still will not boot fully into 
multi-user mode, so it's not clear if the console issue is an important
factor.

  The strange thing is, once I boot it into single-user mode, I can 
manually start all the level-2 services, and they all come up just fine,
no hangs and only one error, the "smartmontools" init says "failed",
which I think I understand and I think is unrelated.

  Obviously, this is a very annoying state for the machine to be in,
and I'd like to get it back to a state where it can boot autonomously.
The last successful autonomous boot was one week prior to the crash
and onset of this problem.

  I have added "set -x" to /etc/init.d/rc, and to the two start-up
scripts /etc/init.d/nullmailer and /etc/init.d/openbsd-inetd, since they
seem to be near the locus of the problem, and what I am seeing is that, 
when attempting a full boot, the system does not always hang in the
same place, and in the one case for which I have really reliable info,
it seemed to hang inside /etc/init.d/rc, having just completed nullmailer
and not yet starting openbsd-inetd -- both have "set -x", but there
were no lines from within the openbsd-inetd file.

  As part of the investigative process, I have commented out all
of the virtual and serial consoles from /etc/inittab, but it's
still got the ctrl-alt-del directives and various power-fail
and power-restore directives in it.

  I should probably add that, after the crash recovery, the root 
file system fsck'd with no errors, and that the system also is monitored
by an intrusion-detection system, which monitors a database of file 
signatures encoding ctime, name, inode number, size, and so forth, for 
critical files in the root filesystem, (all of /usr/lib, /usr/bin, /bin,
/etc, and so forth), and this IDS did not show any signature differences
after the recovery as compared with before, with the exception of the 
grub boot-loader files -- grub had to be installed on the mbr of the
replacement disk in the software RAID1.

  So, based on the fsck and IDS, I believe I can rule out root file
system corruption.

  It's possible there was some kind of intentional configuration change in 
the week prior to the crash that broke something -- such a change would
show up in the IDS, but would have been passed as intentional.  
  The machine did have a permissions audit during this week, but this did 
not involve the root filesystem, and I am having serious difficulty
imagining a change that would break the run-level 2 init, but not break
the single-user init followed by the manual start of the /etc/rc2.d 
services.

  It's also possible that there's some kind of hardware problem --
the console issue sort of points to this, but again, it has to break
the level-2 init but not the single-user-followed-by-manual-rc2 process.
And, once it's up in its sort of "enhanced single user" mode, it seems
fine.

  So, sorry for the long-windedness, but with all that, ctual question is 
this:  What is the difference between these two scenarios, the straight
boot in to run-level 2, which hangs, versus the single-user boot and 
manual start-up of level 2 services?  Since /etc/inittab isn't doing 
consoles anymore, what else is there?  How can the first one fail and the 
second one succeed?

  Any clues greatly appreciated.

					-- A.
-- 
Andrew Reid / reidac@bellatlantic.net



Reply to: