[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [SOLVED] Could do with some help - Wheezy, Kernel updated, now cannot boot



On 14/03/2015 22:45, Bob Proulx wrote:

Normally the idea in rescue mode is that you are presented with a
shell with the root in your main system.  At that point you can mount
the rest of your system.  It would normally be like this:

   # mount -a

However seeing those large numbers 121 and 121 in your device paths
/dev/md121 above I worry that you had /dev/md0 and /dev/md1 types of
names and those have been swapped for /dev/md121 and /dev/md122 and
therefore the paths in your fstab won't work at the moment.  Or they
might be using other paths, labels, uuids, and so forth.

But if the device numbers have changed
then they could only be mounted manually because they won't match what
is in the /etc/fstab.


[Bob, I snipped other sections of your helpful and wide-ranging reply, including some relevant remarks on /dev/md[x] numbering which I'll explain how I got round.]

In short, I have now restored the system, including testing that it can boot from either disk.

At first, in the rescue shell, I was first using the long mount commands for each /dev/md[x] and the mount point, because I thought the dev/md[x] numbers wouldn't match fstab. Once I had the filesystems mounted and started checking for damage (there seems to have been none, fortunately), I saw in fstab that - apart from the / partition - all the other md partitions were loaded by UUID. So, since the rescue system had mounted /dev/md122 on /,

# mount -a

worked for all others. (Except for the nfs mount, but I never looked further into why the rescue system has difficulty with mounting an nfs share. Maybe another topic, under less stress.)

I then made a mistake. I had seen that the installation of the kernel package security update had changed some grub files in /boot. I looked at those - lots of useful initrd and vmlinuz files, and a complicated-looking grub.cfg. The rescue shell gives you 'man' commands and 'info' commands, so I read the grub documentation installed on my system. So I re-inserted grub on the boot partition.

# grub-install /dev/md121
Grub cannot be installed on a partition-less disk
[and a couple of other related warnings]

That frightened me. There obviously were partitions, because mdadm had found them, and the rescue shell was happy with 7 of them mounted. Using gdisk I had another fright when it reported that sda was a gpt disk, with a protected MBR (what?) and no other partitions. That couldn't be right because the installer had found the partitions, so had mdadm.

I then wondered if perhaps grub wasn't involved and I shouldn't be looking at things from a grub and gdisk/gpt viewpoint. Though I thought I had seen the kernel update actually alter grub, maybe I had only seen the initrd and vmlinuz files get updated (and 'assumed' grub was there). Plus, the basic symptom I get on boot is that the loader says

LILO 23 LILO loadiEBDA: too large kernel
or something like that.

Booting was saying, it's LILO, not grub.  Maybe it's right, I thought.

# man lilo
lilo not found
# lilo
lilo not found

Well, it's not right, because lilo isn't there, the machine says. But some version of lilo is on the boot sectors. I read the lilo manpage on the web and saw that it has a fairly simple config in /etc/lilo.conf Checking that I saw there was such a file on the system, dating from 2010, referring to /dev/md0 (/boot, as was before the rescue shell renumbered them) and dev/md1 (root fs before the rescue shell renumbered them). So I needed to alter the lilo conf file, and then execute lilo. Lilo wasn't on the machine, so

# apt-get install lilo

and changed /etc/lilo.conf to say (I'll list these to help others in the future):

(a) the new initrd,
(b) the new vmlinuz,
(c) set boot to using /dev/md121 as the boot device, and
(d) set root to /dev/md122
saved the changed file, and

# lilo

No errors so, exiting the rescue shell, I rebooted.

LILO started, didn't complain about the large kernel, moved on to the assemble the /dev/md[x] (as md0, md1, md2, etc, perfect!) before dying with

Aborted waiting for root fs
Dev [something]:122 not found

Well, that was progress - the machine could boot, there was nothing wrong with the partitions, lilo was no longer corrupted, the md[x] all assembled. But, /dev/md122 was no longer called that. In 'real life' as opposed to the 'rescue shell life', the root filesystem is on /dev/md1. The line

root=/dev/md122

in /etc/lilo.conf caused lilo not find the real fs on /dev/md1. There's a very simple solution, go back to the rescue shell so that you can change lilo.conf, then, to say

root=/dev/md1

but it doesn't work. When you execute lilo after doing this change, it objects:

# lilo
Invalid root filesystem: /dev/md1

Reason: Because /dev/md1 doesn't exist in the rescue shell, so the lilo config preparation system, so carefully, protects you against specifying a root filesystem that it doesn't think exists. It's right to do that, so I had to somehow have consistent names in both the rescue shell, and in the real installation. I looked long at man mdadm and inferred that the only way to alter the names was to dis-assemble and re-assemble with a --name= parameter. But I didn't have an example of a command line with the right order and right other things that needed to be there, and I - really - didn't want to risk compromising the md system that was running and was as-yet undamaged.

The web page for man lilo.conf

http://linux.die.net/man/5/lilo.conf

mentions that partitions can be named in lilo.conf by UUID. On this machine, fstab uses UUID for all but the root filesystem, so I couldn't get the UUID for /dev/md1 from there. But I did find a UUID string in the /boot/grub.cfg file. Rebooting back into the rescue shell, and editing /etc/lilo.conf to use that UUID string

root="UUID={some hex string mixed with some dashes}"
(the inverted commas are important to ensure the second '=' gets passed to the kernel during the boot sequence)

and exiting the shell, then rebooting

the system fully booted and everything sprang back into life.  Phew.

For posterity, here are the key aspects of recovering from this type of problem with a raid1 boot failure.

1. If the boot message says LILO, it *IS* lilo, and it may be necessary to install the lilo package if it is not present, following a distribution upgrade, for example.

2. The rescue shell uses mdadm's 'emergency' names. That will be fine for setting the *boot* device in the lilo.conf, but will not be fine setting the *root* device in lilo.conf

3. Using the rescue shell, get a root filesystem mounted on the target machine's filesystem because you need to edit your real /etc/lilo.conf

4.  In lilo.conf change
boot=/dev/md0 (or whatever your file says here)
to
boot=/dev/md121 (or whatever the rescue shell has labelled the device that you previously had in your 'boot=' line)

5.  In lilo.conf change
root=/dev/md1 (or whatever your lilo.conf file says here)
to
root="UUID={the UUID of your md device that you normally mount as '/'}" - and include the inverted commas and the extra '=' . This UUID string could take some time to find, and don't panic if you find the wrong string, try to find its label another way. If fstab uses a UUID label to mount '/' then try with that. Maybe some other posters could improve this suggestion if they know the 'trick' to get the correct label.

6.  Reboot.  The system should now boot to the normal start.

Thanks to everyone who helped with suggestions.

Apologies, again, for the somewhat random chain of unlinked messages, which must have irritated folk. This was due to this server failing, with the result that without this (mail) server we had no access to the emailed messages from the list - so I could not 'reply' properly - instead I could only see web copies of posts. Hopefully, fixed now.

regards, Ron


Reply to: