Mark Morgan Lloyd wrote:
Mark Morgan Lloyd wrote:What's acceptable cabling practice on this: it's been set up hung off a single controller with the two halves daisy-chained. Cables are Sun or (decent) IBM and it's a Sun differential terminator, I see no failures if the job count is <=4 (but I continue testing this- it's useful extra heat).I think it's worth noting explicitly that there's 12x CPUs in this system. I note (but don't see as directly relevant) that the controller won't load the firmware during startup, has to be done by a manual rmmod/modprobe.Chris, I think some thing from you vanished into the spambin at about 20:30. Please could you resend it to the address below."My previous message said that best practice with SBUS is to use only half of the D1000 per controller channel. 6 fast drives is about the max that you can expect the bus speed limited controller to handle without congestion under heavy loads."I can cope with limited performance, there's times when having plenty of slots into which arbitrary drives can be plugged (e.g. to fix a dud SILO) can be really useful. Having said which, I note that the A1000/D1000 "Just The Facts" explicitly shows the possibility of having both halves of the box connected to a single host controller.
..although that illustration was to an unidentified controller on an Ultra-60.
"OTOH, it seems that Linux may not be handling congestion as gracefully as Solaris."Indeed. In fact, it doesn't appear to be "picking up the pieces" particularly successfully.toss_command: printk(KERN_EMERG "qlogicpti%d: request queue overflow\n", qpti->qpti_id); /* Unfortunately, unless you use the new EH code, which * we don't, the midlayer will ignore the return value, * which is insane. We pick up the pieces like this. */ Cmnd->result = DID_BUS_BUSY; done(Cmnd); return 1; }I'm still working on it to see if I can track it down to a single drive or a particular slot in the rack.Patrick, thanks for your comment about the firmware being at linux-2.6/firmware/qlogic/isp1000.bin.hex in the standard (i.e. non-Debian) kernel.
After much testing, I've tracked the problem down to two Sun/Fujitsu 18.2Gb drives which will kill the entire system fairly promptly if the qlogicpti module's brought up with them in certain slots, even if there are only 6x drives in the array rather than the full 12x. I speculate that there's a problem with SCSI address decoding or similar on the problematic SCA drives.
With these quarantined and replaced by known-good drives to take the array to its full complement of 12x, I can run any combination of up to 10x drives reliably in the array but not the full 12x: trying to do so still causes an eventual kernel panic. Pulling half the CPUs in a crude attempt to reduce concurrency doesn't improve things. The impression I get is that that controller (and/or its supporting firmware and Linux driver) isn't up to handling a full string of 12x drives with a heavy workload.
The test I'm using is to write random data to the start of each drive, then to dd this in blocks of approx 256M to the remainder.
-- Mark Morgan Lloyd markMLl .AT. telemetry.co .DOT. uk [Opinions above are the author's, not those of his employers or colleagues]