[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Etch Software RAID Upgrade Trouble & Suggested Installer Improvements



Summary:
  Various mishaps when recovering a botched software RAID system.
  The rescue functionality of the installer should be improved.





After a somewhat nightmarish (yet finally successful)
upgrade of my main workhorse PC to Linux software RAID,
I have decided to make this list of suggested improvements.
Following the list is a more detailed account of the reasons.


This is in no way meant to diminish or belittle the nice
work that Debian folks have done so far; I appreciate that
very much. However, doing something about the one or other
of those points might help other users in the future.



Suggestions:
************

1. Rescue mode needs MD devices

   The rescue mode of the installer needs a step
   to activate MD devices. Currently, only the plain
   disk partitions are visible; that's no help.

2. Netinstall image needs a ping

   There should be a ping command available on the
   netinstall image. Otherwise, for a multi-card PC
   it is hard to check whether the right interface
   has been configured right.

3. Netinstall's ifconfig needs to set MAC address

   The ifconfig on the netinstall image (from busybox)
   does not allow to set the hardware ethernet address.
   In some scenarios this is important and necessary.

4. Netinstall image should have some packages

   I'm not sure on that ... but having grub, a
   kernel and a modules package would have been
   an immense help.

5. Rescue functionality needs improvement

   The rescue functionality of the installer is
   nice but practically not very useful.
   Polishing the rescue system would have helped me
   in many situations before, not just this case.
   I would love to have more of a standalone
   system (from RAMDISK and/or "Live"-CD).
   In particular the fact that one can't run many
   elementary linux commands (tar, gzip, networking,
   e2fsck, mke2fs, dd, nfs-mount...) without going
   far along in the install process, is a hindrance.
   And the point where the actual installation gets
   manipulated by the installer is not always clear.

6. Grub's built-in documentation is incomprehensible

   Grub is one of those tools that one needs to work
   with when the box isn't running. Grub's and
   grub-install's help are not practically useful.

7. There needs to be a command to copy all data

   Between cp, tar, rsync & friends there are dozens
   of variations how to copy over the files of a
   running system to another location, but none is
   perfect:
     - leave out lost+found
     - leave out /proc, /sys, the automatic /dev
     - copy all "real" files
     - copy the /dev on harddisk under the mounted devfs
       (using mount -bind or so)
   There is really need for a good program that does it;
   IMHO that program should be cp.

8. hdparms' error messages unsatisfying

   When some ATA drivers are not loaded, the hdparms command
   does not let you set DMA mode for a drive. Unfortunately
   the error message is not very helpful in localizing and
   fixing the problem.

9. cdrecord's miserable state is well known

   Like the majority of other Linux users, I wonder when
     $ burn_my_iso_to_cd <iso-file> /dev/cdrom
   will work as expected.



Why:
****


Now, on to the specifics. Here is the account what
happened to me and how I arrived at those suggestions.

A) The upgrade

I decided to buy another IDE disk for my workhorse PC,
to mirror the old one (Software RAID 1) and get some
additional (un-mirrored) space on the new disk for
junk data (VDR movies etc.)

Being an old Debian user, I surely could do that
in-flight without a backup ... :-)
(Some sins get instant punishment).


B) The guide

I followed the excellent guide in
xtronics.com/reference/SATA-RAID-debian-for-2.6.html
In short:
 - create degraded RAID on new disk
 - copy data to new disk
 - modify initrd, fstab, grub
 - test booting new system
 - re-format old disk and add to RAID
 - finalize initrd, fstab, grub
 - done


C) Trouble begins

It was at the testing stage, having successfully booted
into the degraded RAID system on the new disk, where
I decided to record a movie.

Re-formatting the old disk and adding it to the RAID,
I noticed that the system became very unresponsive and xine
had trouble writing the movie to disk. I found out that
the DMA was turned off and reconstruction of the RAID
took a lot of CPU and disk activity.

I could not set the DMA mode with hdparm, apparently some
modules for that were missing. (I can't reconstruct since
now the DMA is miraculously turned on).


D) The fatal mistake

I had to stop recording since the movie would get chopped
and RAID reconstruction would take forever (20 h).

I decided to reboot to get the DMA working and forgot that
I had just re-formatted the /boot partition on the old disk,
so grub would not find any chain loader, obviously.


E) The painful recovery

- Grub wouldn't load anything, the system did not boot.
- I tried a sarge installer CD that didn't recognize the
  md signatures of the partitions.
- I couldn't figure out how to run the grub installer from
  a mounted pseudo-root directory where the devices were
  named differently (old /dev/hde vs. new /dev/sda for SATA).
- An old Knoppix allowed me to configure the router
  functionality and download the installer image.
- To burn the image, I had to download k3b since
  I couldn't figure out either cdrecord or cdrdao
  within reasonable time (USB-CDrom external writer
  with broken original writer in Laptop).
- The rescue mode of the netinst RC1 CD didn't
  let me choose the MD partitions for root device.
- I could not get to the Internet since my cable
  modem only responds to a certain MAC address that
  can't be set with the ifconfig on netinst.
- Finally running the install process far enough
  to get the md devices mounted (it's unclear how
  to do that manually instead of using the partitioner),
  I had access to a ping and a working ifconfig
  to get Internet access.
- From the Internet, I could then download grub
  and install it manually after fighting against
  /proc and /sys.
- The installer had overwritten my /etc/fstab
  which I then fixed.


F) Conclusion

I have purposefully omitted the many other failures,
most of them results of my own faults, that made this
endeavour take a total 11 hours into the night.

I think the steps that I described show that while the
new installer has gotten very well in its main function
(as an INSTALLER), it still lacks most features as a
rescue system.

Going through various attempts at unbootable USB-stick-rescuers,
and old Knoppix and Sarge installers, I'm quite convinced that
an effective rescue system MUST be based on the same kernel
series and system setup philosophies as the primary installed
system (what with udev, /sys, /proc, md5 partition autostart
for new superblocks, copyable kernel that allows mounting the
target partition as root etc.).

Therefore I'll conclude with the plead that the fine folks
who did such a great work on the new installer might now
turn their eye on its rescue functionality, and I hope this
comment is helpful.


Tired but finally successful

Claus



-- 
Claus Fischer <claus.fischer@clausfischer.com>
http://www.clausfischer.com/



Reply to: