[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Running sun4d (SPARCserver etc.) SMP



I previously reported that I'd got a SPARCserver 1000E running SMP 8x CPUs by a
hack upon which David Miller heaped an unjustified amount of scorn and
opprobrium.

Well, I think I've tracked down the underlying reason and thought it worth
writing up in case it's relevant to any other architectures. So far I'm working
with 2.2.20, I expect the fix is OK for 2.2.26 (last version of 2.2) and I've
made some progress with 2.3.

To set the scene, here's my original hack:

> Looking at what other people had reported about this problem and using the
> LEDs and PROM debugging messages I eventually determined that the DAE can be
> avoided by making a one-line kernel change, at which point the SS1000E runs
> 8x CPU SMP reliably. Specifically, I have had more than one machine with up
> to 8x SuperSparc-50s running with firmware 2.23, however I've had problems
> with another firmware version where it gave a watchdog timeout /before/ the
> "Booting Linux" message: I think this is probably a different issue.
>
> In arch/sparc/kernel/sun4d_smp.c there is a call to calibrate_delay(): this
> should be commented out. As far as I can tell (and I stress that I am neither
> a Sun guru nor a kernel hacker) it is only used for the secondary CPUs which
> default to the same speed as the primary one- and who in their right mind
> would try to run dissimilar CPUs SMP?
>
> Furthermore, looking at the calibrate_delay() code I suspect that the way
> that the global loops_per_jiffy variable is being used as a scratchpad is
> unsafe. Specifically, if on a particular SMP architecture (here sun4d)
> interrupts are not fully disabled while calibrate_delay() is running then
> anything which inadvertently uses the value of loops_per_jiffy could get
> into trouble.

I still fire that machine up now and again, for various jobs it's useful having
that many CPUs even if they're slow. A few days ago I focussed on the fact that
it was assigning all IRQs to CPU 0 irrespective of which board the interrupting
device was on, this turned out to be because in arch/sparc/kernel/sun4d_irq.c
sun4d_distribute_irqs() assumes that SBus_chain is initialised, but this isn't
done until sbus_init() is called somewhat later.

Making sure that sbus_init() is called before before SMP is set up fixes not
only the IRQ distribution problem but also eliminates the requirement for my
earlier hack, presumably because the kernel now knows how to protect the
non-reentrant part of calibrate_delay().

Unfortunately I'm not able to test this on either of the larger sun4d systems-
the Sun SPARCcenter or Cray CS6400. If anybody in the UK's got one of the latter
looking for a good home I'd like to know about it :-)

Incidentally I came across some interesting material a couple of days ago that
indicates that the sun4d architecture actually originated at Xerox PARC in the
late 80s
http://www.fing.edu.uy/inco/grupos/cecal/hpc/proyectos/amstp/refs/Compcon-SC2000.pdf.gz
and http://www.perfdynamics.com/Bio/njg.html Does anybody have any further
information on this?

-- 
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]



Reply to: