[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: jessie won't install/boot on a Dell Poweredge R815



I'd like to thank everyone for helping out.

Here is an update on installing jessie on R815s.

I succeeded in installing on three of my four R815s. But I am holding off on
the last because it is my file server and there are still issues. Please read
on. I don't believe that the problem is solved and there may be a bug lurking
that can lead to data loss.

Here is what I did.

 1. Before the install, while still running wheezy, I upgraded the BIOS.
      R815_BIOS_JF8YH_LN_3.2.2.BIN
    This seemed to alleviate the problem of the jessie installer failing to
    find the ISO. More on this later.

 2. Before the install, while still running wheezy, I reduced the number of
    components of md0 from 6 to 4. This was in response to Steve' suggestion.
      mdadm /dev/md0 --fail /dev/sdf1
      mdadm /dev/md0 --fail /dev/sde1
      mdadm /dev/md0 --remove /dev/sdf1
      mdadm /dev/md0 --remove /dev/sde1

 3. I did a fresh USB install of jessie. More on this later.

 4. When it asked about which devices to install grub, I answered "manual" and
    then typed /dev/sdb. More on this later.

 5. After the fresh install, I rebooted, and in grub, I added rootdelay=20.
    This was in response to Don's suggestion.

 6. After the reboot, I ran my standard post-install script. Among other
    things, this installs numerous packages, makes a small number of mods to
    /etc, and does a dpkg-reconfigure grub-pc. When it did that, I specified
    only the 4 drives with active components of md0 and added rootdelay=20.

 7. I rebooted. More on this later.

Now for the issues.

 A. Even after the BIOS upgrade, when it no longer fails to find the ISO,
    during the installer phase where it searches for an ISO, I notice
    nondetermininstic behavior. Sometimes it searchs sdb{1,2,3}, sdc{1,2,3},
    sdd{1,2,3}, sde{1,2,3}, sdf{1,2,3}, sdg{1,2,3}, sd{a,b,c,d,e,f,g} and
    eventually finds an ISO (sda is the USB dongle). Sometimes it finds the
    ISO right away without any searching. This doesn't cause problems but I
    believe that it is symptomatic of other problems.

 B. I'm not sure that reducing the number of components of md0 to 4 and/or
    adding rootdelay=20 really solved the problem. I think it just reduced the
    likelihood of occurrence. On one of the machines (arivu), during the
    reboot in step (7), at an early phase of the boot, the machine first
    reported that it found all 4 components of md0 and all 6 components of md1.
    Then at  a later phase it reported that there were errors on 3 of the 4
    components. After the machine came up, md0 had only one component. Three
    of the four components were in failed (F) state. I did mdadm --remove to
    them and then mdadm --add to them. This doesn't happen all of the time. But
    it happens some of the time.


      qobi@upplysingaoflun>all-n-3g dmesg --level=err
      upplysingaoflun:
      verstand:
      arivu:
      [   28.012558] mpt2sas0: fault_state(0x265d)!
      [   29.231355] end_request: I/O error, dev sdb, sector 2056
      [   29.231600] end_request: I/O error, dev sdc, sector 2056
      [   29.231773] end_request: I/O error, dev sde, sector 2056
      [   29.232020] end_request: I/O error, dev sda, sector 2056
      perisikan:
      [   13.035132] mpt2sas0: fault_state(0x265d)!
      [   28.600099] mpt2sas0: fault_state(0x265d)!
      qobi@upplysingaoflun>

      qobi@upplysingaoflun>all-n-3g "dmesg --level=warn|fgrep -i error|fgrep -v ACPI"
      upplysingaoflun:
      verstand:
      arivu:
      [   29.231430] md: super_written gets error=-5, uptodate=0
      [   29.231670] md: super_written gets error=-5, uptodate=0
      [   29.231869] md: super_written gets error=-5, uptodate=0
      [   29.232117] md: super_written gets error=-5, uptodate=0
      perisikan:
      qobi@upplysingaoflun>

    (These are my four R815s. upplysingaflun is the file server that has not
    been updated. The other three have.) Note that one machine reports no
    "mpt2sas0: fault_state(0x265d)" errors, one machine reports one, and one
    machine reports two. Note that the machine that dropped three components
    of md0 during boot reported I/O errors on all 4 disks with the 4
    components of md0. I don't believe that there really are faulty disks.
    Whenever I observe any of the behavior reported in this email, it is
    almost always associated with dmesg reporting the same error on the same
    sector 2056 (sometimes 2058 or 2062). Given the dozens of attempted
    reinstalls and reboots, at this point, I have seen this on almost all, if
    not all, of the six disks on each of the four machines. I don't believe
    that 24 disks all have the same bad sectors.

 C. In step (3), sometimes, but not always, during the install, I get a screen
    that says that some partition failed. If offers a menu of two options. I
    select "retry". Sometimes, but not always, this causes md0 to drop
    components in the installer, which I fix by going to ctrl-alt-f2 during
    the install and doing mdadm --remove and mdadm --add.

 D. In step (4), there appears to be nondeterminism in the serial numbers of
    the disks that get reported in the menu of options of where to install
    grub. Sometimes, the disks get reported as ata-*, sometimes as scsi-*.
    Note that all of my disks are SATA so the ones reported as scsi-* are
    clearly in error. If I do fresh installs multiple times on the same
    machine, each time it reports different serial numbers for the disks.

 E. In step (4), it appears that if I select the menu item "sdb", it reports
    that it tries to install on "md0" and then gives a red error screen. At
    that point, I go to ctrl-alt-f2 and observe that it has dropped many
    components of md0, usually all but one.

 F. In step (4), sometimes, but not always, I get warning screens about EFI.

 G. In step (4), if I select "manual" and then type "sdb", it appears to work.
    But sometimes, but not always, I get warning screens about EFI.

Note that there is a lot of nondeterministic behavior (all cases above where I
say "sometimes"). In all cases, I do exactly the same thing over and over to
the same machine and get different behavior.

    Jeff (http://engineering.purdue.edu/~qobi)


Reply to: