I've written a document on using Linux software RAID with hot-swap SCSI hardware. It's slightly specific to the hardware I use (I wrote it for internal use) but can easily be adapted to be more generic. If someone wants to add it to a HOWTO or something then be my guest, please give me appropriate credit. -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/ Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home pageTitle: Linux Software RAID and hot-swap SCSI
md1 : active raid1 sdb1[1] sda1[0] 1999936 blocks [2/2] [UU]This is for a software RAID (Meta-Disk) named /dev/md1 which is comprised of /dev/sda1 and /dev/sdb1 devices in a RAID-1 (mirroring) setup. The Dell 1U server machines will have all their disks in software RAID-1 arrays.
When you have a disk installed and recognised by Linux you can then add partitions to degraded RAID arrays at any time with the raidhotadd command. Here is the data you see in /proc/mdstat for a degraded RAID array:
md1 : active raid1 sdb1[1](F) sda1[0] 1999936 blocks [2/1] [U_]Device /dev/sdb1 has failed (to generate this error I unplugged the disk /dev/sdb). Now I have just swapped the hard drive (see below) and want to put the new drive back in the array. Firstly I must remove the record for the disk in the failed (F) state with the command raidhotremove /dev/md1 /dev/sdb1, which gives the following state in /proc/mdstat:
md1 : active raid1 sda1[0] 1999936 blocks [2/1] [U_]Once this has been done for every partition that was in a RAID array the drive will be regarded as being unsed by Linux which allows it to be repartitioned or unregistered (see below for details of hardware recognition).
If you want to instruct the software RAID driver to stop using a partition on a functional disk then you would use the raidsetfaulty command, EG: raidsetfaulty /dev/md1 /dev/sdb1 to set the partition in failed state so that you can then use raidhotremove to remove it.
When you have a new partition you want to add to add to a RAID set you can use the command raidhotadd ARRAY DEVICE to add it, EG raidhotadd /dev/md1 /dev/sdb1 which results in the following data in /proc/mdstat:
md1 : active raid1 sdb1[2] sda1[0] 1999936 blocks [2/1] [U_] [=======>.............] recovery = 37.8% (755976/1999936) finish=0.4min speed=48732K/secNote that when a device name is followed by [2] then it's in a reconstruction state.
Once the disk is no longer in use and it is unplugged (the two operations of making it unused and unplugging it can proceed in any order) then you can inform the SCSI driver of the removal with the command scsiadd -r ID. ID is the identity of the disk you want to remove, which is determined by which bay the drive is in. The bays are numbered 0, 1, and 2 from left to right (and the numbers are printed on top of the case - where you can't see them when it's mounted).
So if you want to swap the second disk which Linux usually (but not always) knows as sdb then you use the command scsiadd -r 1 to inform the Linux SCSI driver that the disk is removed, then you can insert the new drive (or re-insert the drive you just removed) and use the command scsiadd -a ID (which in this example is scsiadd -a 1 to make Linux recognise the new disk. After that time you are free to partition the disk and add it to software RAID ready for use.
If you see the error message parity error detected then after waiting for a minute or two (and seeing many other error messages) the error should be corrected and the drive should be recognised. It is best to leave the drive physically in place for some time before running the scsiadd -a command to reduce the risk of this error. Sometimes this error can only be solved with a hardware reset...
Then you have to to configure LILO with the root=/dev/md1 and boot=/dev/md1 lines to configure the root file system as the boot device (NB if you use a RAID device other than /dev/md1 for the root file system then adjust the LILO configuration accordingly. The LILO configuration is in /etc/lilo.conf, to apply the changes run the lilo command with no parameters.
Finally you have to use the install-mbr command to set up a boot block that the BIOS can run to load the LILO block, use the following commands:
install-mbr /dev/sda install-mbr /dev/sdbThis installs the Debian MBR on both hard drives so that whichever drive is removed the other has a boot loader that can then load LILO to boot Linux.
If a new software RAID installation has to be done then get Paul or Russell to do it.