Bug#776192: mptsas probe failure and crash, probably related to udev timeout

This reply is purely to assist any other poor souls like myself who have to suffer the fallout from this problem with a possible non-code-hack-related workaround for *SOME* configurations.

We have two mptsas desktop systems one of which failed to boot after Jessie FAI install. The other had been running for weeks w/o incident.

working system: Dell Precision T5400, SAS 6/iR in RAID1 with 2 Seagate SATA disks (ST3500641AS & ST3500418AS)

nonworking system: Dell Precision T3500, SAS 6/iR with RAID1 using 2 Fujitsu SAS drives (MBA3147RC) and one pass-thru Hitachi SAS drive (HUS154545VLS300).

It was discovered that that working system had the drives connected to the high-port connector (ports 4-7), while the non-working system had them more-obviously connected to the low-port connector (ports 0-3). After moving the cabling to the high-ports on the non-working system, it now boots. It still takes a LONG time to probe the disks (maybe 20+ secs), but it's apparently within limits. The other working system doesn't pause during probe at all, so this may be a SATA vs SAS thing or a drive-vendor/model thing, too.. The only fallout (possibly unrelated) is that a shutdown results in a hang after the final shutdown message requiring a manual power-off (haven't done other tests, could be a fluke). We used to have that issue on Precision 390/490/690 systems and had some success with kernel bootline 'reboot=bios' (IIRC). We haven't installed any other T3500s yet, so maybe this is common?

This unfortunately, isn't a solution for the Integrated MPTSAS cards in Dell's 1U 9th gen servers (e.g. PE 1950) while the card guts appear the same (just without the L-bracket) -- those systems' BIOS throw an error if you attach the internal drive cabling to the high-port connector. (it throws something like "Invalid Disk Configuration") That seems like some arbitrary BIOS constraint Dell put in. (this is from memory of trying to do this config when i had a too-short lead cable to the drives that it wouldn't reach the low-port connector in the past)

It's also probably not a solution for folks who have > 4 drives. I'm also not sure if i added another drive if it'd suddenly stop working again.

But, for anyone else out there with a similar configuration, this may save you.

IMHO -- this problem is going to hit a lot of people -- it affects at least 50 of my machines (i actually have several hundred PowerEdge server systems with SAS 5/6 controllers, but most of those are running RHEL5/6) -- I think it is truly absurd that the systemd devs refuse to provide a bootline option to increase the event timeout or even to increase the hardcode value for a limited time. The decree that "30 seconds is absolute and long-enough" reminds me of such lack of insight as "noone will ever need more than 640K of RAM". It is simple arrogance to pull a random number and refuse to budge on it knowing full well that changing that number ever so slightly would enable a lot more systems to run, regardless of the bugs in the mptsas code. Then you can revisit the "problem" after the mptsas code gets fixed (and hopefully that actually happens soon). Really, how often is 30 vs 60 vs 180 seconds going to make over the entire planet anyway? If something is truly really busted, you'll find out about it. having an arbitrary 30 second "limitation" isn't going to help those people with real busted hardware/kernelmods -- it's just hurting an audience of people stuck with non-optimal drivers/hardware that were working perfectly well (enough) prior to these changes.

thanks

--stephen

Stephen Dowdy - Systems Administrator - NCAR/RAL
303.497.2869 - sdowdy@ucar.edu - http://www.ral.ucar.edu/~sdowdy/