Re: jessie won't install/boot on a Dell Poweredge R815
I'd like to thank everyone for helping out.
Here is an update on installing jessie on R815s.
I succeeded in installing on three of my four R815s. But I am holding off on
the last because it is my file server and there are still issues. Please read
on. I don't believe that the problem is solved and there may be a bug lurking
that can lead to data loss.
Here is what I did.
1. Before the install, while still running wheezy, I upgraded the BIOS.
R815_BIOS_JF8YH_LN_3.2.2.BIN
This seemed to alleviate the problem of the jessie installer failing to
find the ISO. More on this later.
2. Before the install, while still running wheezy, I reduced the number of
components of md0 from 6 to 4. This was in response to Steve' suggestion.
mdadm /dev/md0 --fail /dev/sdf1
mdadm /dev/md0 --fail /dev/sde1
mdadm /dev/md0 --remove /dev/sdf1
mdadm /dev/md0 --remove /dev/sde1
3. I did a fresh USB install of jessie. More on this later.
4. When it asked about which devices to install grub, I answered "manual" and
then typed /dev/sdb. More on this later.
5. After the fresh install, I rebooted, and in grub, I added rootdelay=20.
This was in response to Don's suggestion.
6. After the reboot, I ran my standard post-install script. Among other
things, this installs numerous packages, makes a small number of mods to
/etc, and does a dpkg-reconfigure grub-pc. When it did that, I specified
only the 4 drives with active components of md0 and added rootdelay=20.
7. I rebooted. More on this later.
Now for the issues.
A. Even after the BIOS upgrade, when it no longer fails to find the ISO,
during the installer phase where it searches for an ISO, I notice
nondetermininstic behavior. Sometimes it searchs sdb{1,2,3}, sdc{1,2,3},
sdd{1,2,3}, sde{1,2,3}, sdf{1,2,3}, sdg{1,2,3}, sd{a,b,c,d,e,f,g} and
eventually finds an ISO (sda is the USB dongle). Sometimes it finds the
ISO right away without any searching. This doesn't cause problems but I
believe that it is symptomatic of other problems.
B. I'm not sure that reducing the number of components of md0 to 4 and/or
adding rootdelay=20 really solved the problem. I think it just reduced the
likelihood of occurrence. On one of the machines (arivu), during the
reboot in step (7), at an early phase of the boot, the machine first
reported that it found all 4 components of md0 and all 6 components of md1.
Then at a later phase it reported that there were errors on 3 of the 4
components. After the machine came up, md0 had only one component. Three
of the four components were in failed (F) state. I did mdadm --remove to
them and then mdadm --add to them. This doesn't happen all of the time. But
it happens some of the time.
qobi@upplysingaoflun>all-n-3g dmesg --level=err
upplysingaoflun:
verstand:
arivu:
[ 28.012558] mpt2sas0: fault_state(0x265d)!
[ 29.231355] end_request: I/O error, dev sdb, sector 2056
[ 29.231600] end_request: I/O error, dev sdc, sector 2056
[ 29.231773] end_request: I/O error, dev sde, sector 2056
[ 29.232020] end_request: I/O error, dev sda, sector 2056
perisikan:
[ 13.035132] mpt2sas0: fault_state(0x265d)!
[ 28.600099] mpt2sas0: fault_state(0x265d)!
qobi@upplysingaoflun>
qobi@upplysingaoflun>all-n-3g "dmesg --level=warn|fgrep -i error|fgrep -v ACPI"
upplysingaoflun:
verstand:
arivu:
[ 29.231430] md: super_written gets error=-5, uptodate=0
[ 29.231670] md: super_written gets error=-5, uptodate=0
[ 29.231869] md: super_written gets error=-5, uptodate=0
[ 29.232117] md: super_written gets error=-5, uptodate=0
perisikan:
qobi@upplysingaoflun>
(These are my four R815s. upplysingaflun is the file server that has not
been updated. The other three have.) Note that one machine reports no
"mpt2sas0: fault_state(0x265d)" errors, one machine reports one, and one
machine reports two. Note that the machine that dropped three components
of md0 during boot reported I/O errors on all 4 disks with the 4
components of md0. I don't believe that there really are faulty disks.
Whenever I observe any of the behavior reported in this email, it is
almost always associated with dmesg reporting the same error on the same
sector 2056 (sometimes 2058 or 2062). Given the dozens of attempted
reinstalls and reboots, at this point, I have seen this on almost all, if
not all, of the six disks on each of the four machines. I don't believe
that 24 disks all have the same bad sectors.
C. In step (3), sometimes, but not always, during the install, I get a screen
that says that some partition failed. If offers a menu of two options. I
select "retry". Sometimes, but not always, this causes md0 to drop
components in the installer, which I fix by going to ctrl-alt-f2 during
the install and doing mdadm --remove and mdadm --add.
D. In step (4), there appears to be nondeterminism in the serial numbers of
the disks that get reported in the menu of options of where to install
grub. Sometimes, the disks get reported as ata-*, sometimes as scsi-*.
Note that all of my disks are SATA so the ones reported as scsi-* are
clearly in error. If I do fresh installs multiple times on the same
machine, each time it reports different serial numbers for the disks.
E. In step (4), it appears that if I select the menu item "sdb", it reports
that it tries to install on "md0" and then gives a red error screen. At
that point, I go to ctrl-alt-f2 and observe that it has dropped many
components of md0, usually all but one.
F. In step (4), sometimes, but not always, I get warning screens about EFI.
G. In step (4), if I select "manual" and then type "sdb", it appears to work.
But sometimes, but not always, I get warning screens about EFI.
Note that there is a lot of nondeterministic behavior (all cases above where I
say "sometimes"). In all cases, I do exactly the same thing over and over to
the same machine and get different behavior.
Jeff (http://engineering.purdue.edu/~qobi)
Reply to: