Strange boot-time hang (long, sorry)
Greetings all --
I am having difficulty diagnosing a strange boot-time hang on
one of my systems. I have googled around for clues, but I'm
coming up empty.
The system is a file server, it's a dual-core Opteron machine,
4G RAM, root FS on a software RAID1 array and the served files
on a 3ware RAID5 system with a capacity of about 2.6 TB. The
important services are ssh, nfsd, and nfs, but it's also a
Several days ago, this machine crashed, probably due to a disk
failure in the RAID1 root-fs array, and subsequent to the disk
replacement, was not able to reboot fully into run-level 2, instead
always hanging somewhere in the /etc/rc2.d/S20* part of the start-up
The hang-up seemed, from available error messages, to be related
to the serial console on ttyS0, so I removed that from the boot line
and from /etc/inittab, but even so, found I was only able to get the
machine into single-user mode. It still will not boot fully into
multi-user mode, so it's not clear if the console issue is an important
The strange thing is, once I boot it into single-user mode, I can
manually start all the level-2 services, and they all come up just fine,
no hangs and only one error, the "smartmontools" init says "failed",
which I think I understand and I think is unrelated.
Obviously, this is a very annoying state for the machine to be in,
and I'd like to get it back to a state where it can boot autonomously.
The last successful autonomous boot was one week prior to the crash
and onset of this problem.
I have added "set -x" to /etc/init.d/rc, and to the two start-up
scripts /etc/init.d/nullmailer and /etc/init.d/openbsd-inetd, since they
seem to be near the locus of the problem, and what I am seeing is that,
when attempting a full boot, the system does not always hang in the
same place, and in the one case for which I have really reliable info,
it seemed to hang inside /etc/init.d/rc, having just completed nullmailer
and not yet starting openbsd-inetd -- both have "set -x", but there
were no lines from within the openbsd-inetd file.
As part of the investigative process, I have commented out all
of the virtual and serial consoles from /etc/inittab, but it's
still got the ctrl-alt-del directives and various power-fail
and power-restore directives in it.
I should probably add that, after the crash recovery, the root
file system fsck'd with no errors, and that the system also is monitored
by an intrusion-detection system, which monitors a database of file
signatures encoding ctime, name, inode number, size, and so forth, for
critical files in the root filesystem, (all of /usr/lib, /usr/bin, /bin,
/etc, and so forth), and this IDS did not show any signature differences
after the recovery as compared with before, with the exception of the
grub boot-loader files -- grub had to be installed on the mbr of the
replacement disk in the software RAID1.
So, based on the fsck and IDS, I believe I can rule out root file
It's possible there was some kind of intentional configuration change in
the week prior to the crash that broke something -- such a change would
show up in the IDS, but would have been passed as intentional.
The machine did have a permissions audit during this week, but this did
not involve the root filesystem, and I am having serious difficulty
imagining a change that would break the run-level 2 init, but not break
the single-user init followed by the manual start of the /etc/rc2.d
It's also possible that there's some kind of hardware problem --
the console issue sort of points to this, but again, it has to break
the level-2 init but not the single-user-followed-by-manual-rc2 process.
And, once it's up in its sort of "enhanced single user" mode, it seems
So, sorry for the long-windedness, but with all that, ctual question is
this: What is the difference between these two scenarios, the straight
boot in to run-level 2, which hangs, versus the single-user boot and
manual start-up of level 2 services? Since /etc/inittab isn't doing
consoles anymore, what else is there? How can the first one fail and the
second one succeed?
Any clues greatly appreciated.
Andrew Reid / email@example.com