Re: [SOLVED] Could do with some help - Wheezy, Kernel updated, now cannot boot

To: debian-user@lists.debian.org
Subject: Re: [SOLVED] Could do with some help - Wheezy, Kernel updated, now cannot boot
From: Ron Leach <ronleach@tesco.net>
Date: Sun, 15 Mar 2015 10:20:20 +0000
Message-id: <[🔎] 55055CE4.4070401@tesco.net>
In-reply-to: <[🔎] 20150314161403298208905.NoCcsPlease@bob.proulx.com>
References: <[🔎] 5502931A.5060100@tesco.net> <[🔎] 20150314161403298208905.NoCcsPlease@bob.proulx.com>

On 14/03/2015 22:45, Bob Proulx wrote:

Normally the idea in rescue mode is that you are presented with a
shell with the root in your main system.  At that point you can mount
the rest of your system.  It would normally be like this:

   # mount -a

However seeing those large numbers 121 and 121 in your device paths
/dev/md121 above I worry that you had /dev/md0 and /dev/md1 types of
names and those have been swapped for /dev/md121 and /dev/md122 and
therefore the paths in your fstab won't work at the moment.  Or they
might be using other paths, labels, uuids, and so forth.

But if the device numbers have changed
then they could only be mounted manually because they won't match what
is in the /etc/fstab.

[Bob, I snipped other sections of your helpful and wide-ranging reply,including some relevant remarks on /dev/md[x] numbering which I'llexplain how I got round.]

In short, I have now restored the system, including testing that itcan boot from either disk.

At first, in the rescue shell, I was first using the long mountcommands for each /dev/md[x] and the mount point, because I thoughtthe dev/md[x] numbers wouldn't match fstab. Once I had thefilesystems mounted and started checking for damage (there seems tohave been none, fortunately), I saw in fstab that - apart from the /partition - all the other md partitions were loaded by UUID. So,since the rescue system had mounted /dev/md122 on /,


# mount -a

worked for all others. (Except for the nfs mount, but I never lookedfurther into why the rescue system has difficulty with mounting an nfsshare. Maybe another topic, under less stress.)

I then made a mistake. I had seen that the installation of the kernelpackage security update had changed some grub files in /boot. Ilooked at those - lots of useful initrd and vmlinuz files, and acomplicated-looking grub.cfg. The rescue shell gives you 'man'commands and 'info' commands, so I read the grub documentationinstalled on my system. So I re-inserted grub on the boot partition.


# grub-install /dev/md121
Grub cannot be installed on a partition-less disk
[and a couple of other related warnings]

That frightened me. There obviously were partitions, because mdadmhad found them, and the rescue shell was happy with 7 of them mounted.Using gdisk I had another fright when it reported that sda was a gptdisk, with a protected MBR (what?) and no other partitions. Thatcouldn't be right because the installer had found the partitions, sohad mdadm.

I then wondered if perhaps grub wasn't involved and I shouldn't belooking at things from a grub and gdisk/gpt viewpoint. Though Ithought I had seen the kernel update actually alter grub, maybe I hadonly seen the initrd and vmlinuz files get updated (and 'assumed' grubwas there). Plus, the basic symptom I get on boot is that the loadersays


LILO 23 LILO loadiEBDA: too large kernel
or something like that.

Booting was saying, it's LILO, not grub.  Maybe it's right, I thought.

# man lilo
lilo not found
# lilo
lilo not found

Well, it's not right, because lilo isn't there, the machine says. Butsome version of lilo is on the boot sectors. I read the lilo manpageon the web and saw that it has a fairly simple config in/etc/lilo.conf Checking that I saw there was such a file on thesystem, dating from 2010, referring to /dev/md0 (/boot, as was beforethe rescue shell renumbered them) and dev/md1 (root fs before therescue shell renumbered them). So I needed to alter the lilo conffile, and then execute lilo. Lilo wasn't on the machine, so


# apt-get install lilo

and changed /etc/lilo.conf to say (I'll list these to help others inthe future):


(a) the new initrd,
(b) the new vmlinuz,
(c) set boot to using /dev/md121 as the boot device, and
(d) set root to /dev/md122
saved the changed file, and

# lilo

No errors so, exiting the rescue shell, I rebooted.

LILO started, didn't complain about the large kernel, moved on to theassemble the /dev/md[x] (as md0, md1, md2, etc, perfect!) before dyingwith


Aborted waiting for root fs
Dev [something]:122 not found

Well, that was progress - the machine could boot, there was nothingwrong with the partitions, lilo was no longer corrupted, the md[x] allassembled. But, /dev/md122 was no longer called that. In 'real life'as opposed to the 'rescue shell life', the root filesystem is on/dev/md1. The line


root=/dev/md122

in /etc/lilo.conf caused lilo not find the real fs on /dev/md1.There's a very simple solution, go back to the rescue shell so thatyou can change lilo.conf, then, to say


root=/dev/md1

but it doesn't work. When you execute lilo after doing this change,it objects:


# lilo
Invalid root filesystem: /dev/md1

Reason: Because /dev/md1 doesn't exist in the rescue shell, so thelilo config preparation system, so carefully, protects you againstspecifying a root filesystem that it doesn't think exists. It's rightto do that, so I had to somehow have consistent names in both therescue shell, and in the real installation. I looked long at manmdadm and inferred that the only way to alter the names was todis-assemble and re-assemble with a --name= parameter. But I didn'thave an example of a command line with the right order and right otherthings that needed to be there, and I - really - didn't want to riskcompromising the md system that was running and was as-yet undamaged.


The web page for man lilo.conf

http://linux.die.net/man/5/lilo.conf

mentions that partitions can be named in lilo.conf by UUID. On thismachine, fstab uses UUID for all but the root filesystem, so Icouldn't get the UUID for /dev/md1 from there. But I did find a UUIDstring in the /boot/grub.cfg file. Rebooting back into the rescueshell, and editing /etc/lilo.conf to use that UUID string


root="UUID={some hex string mixed with some dashes}"

(the inverted commas are important to ensure the second '=' getspassed to the kernel during the boot sequence)


and exiting the shell, then rebooting

the system fully booted and everything sprang back into life.  Phew.

For posterity, here are the key aspects of recovering from this typeof problem with a raid1 boot failure.

1. If the boot message says LILO, it *IS* lilo, and it may benecessary to install the lilo package if it is not present, followinga distribution upgrade, for example.

2. The rescue shell uses mdadm's 'emergency' names. That will befine for setting the *boot* device in the lilo.conf, but will not befine setting the *root* device in lilo.conf

3. Using the rescue shell, get a root filesystem mounted on thetarget machine's filesystem because you need to edit your real/etc/lilo.conf


4.  In lilo.conf change
boot=/dev/md0 (or whatever your file says here)
to

boot=/dev/md121 (or whatever the rescue shell has labelled the devicethat you previously had in your 'boot=' line)


5.  In lilo.conf change
root=/dev/md1 (or whatever your lilo.conf file says here)
to

root="UUID={the UUID of your md device that you normally mount as'/'}" - and include the inverted commas and the extra '=' .This UUID string could take some time to find, and don't panic if youfind the wrong string, try to find its label another way. If fstabuses a UUID label to mount '/' then try with that. Maybe some otherposters could improve this suggestion if they know the 'trick' to getthe correct label.


6.  Reboot.  The system should now boot to the normal start.

Thanks to everyone who helped with suggestions.

Apologies, again, for the somewhat random chain of unlinked messages,which must have irritated folk. This was due to this server failing,with the result that without this (mail) server we had no access tothe emailed messages from the list - so I could not 'reply' properly -instead I could only see web copies of posts. Hopefully, fixed now.


regards, Ron

Reply to:

References:
- Re: Could do with some help - Wheezy, Kernel updated, now cannot boot
  - From: Ron Leach <ronleach@tesco.net>
- Re: Could do with some help - Wheezy, Kernel updated, now cannot boot
  - From: Bob Proulx <bob@proulx.com>

Prev by Date: Re: configuring exim4 smtp to use SSL
Next by Date: Re: wheezy to testing
Previous by thread: Re: Could do with some help - Wheezy, Kernel updated, now cannot boot
Next by thread: How to run policykit with icewm ?
Index(es):
- Date
- Thread