[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Frequent system crash -- with gcc3.2 (?)



Oliver Elphick <olly@lfix.co.uk> writes:
> 
> Since it is CPU0 every time, I guess the question is answered.  Is there
> any generic way to know which processor on the board is CPU0?

Not just by inspection.  But, if you boot once with both processors
installed and again with one processor removed, you'll be able to tell
by careful comparison of the kernel logs.

The processors are assigned "physical APIC IDs" which should
correspond to the physical slot/socket on the motherboard.  Linux
assigns a (logical) cpuid, too---on a two-processor machine, 0 is
given to the boot processor and 1 is given to the other processsor.

Now, the numbers displayed in "/proc/interrupts" are the logical ids,
so we know logical CPU0, the boot CPU, is misbehaving.

Booting with both processors installed, you should see a message:

        Booting processor 1/0 eip 2000

This is printed when the kernel brings the second processor up: it
will either read "1/0" or "1/1".  The first number is the logical ID;
the second number is the physical ID.  In particular, if your machine
says "1/0" like mine does, you know that the logical and physical IDs
are switched, so logical CPU0 (the misbehaving boot CPU) is physical
CPU#1.

Now, remove one of the processors and reboot.  It'll definitely be
assigned logical ID 0, since it's the only processor.  However, in the
kernel logs, you'll see the physical processor ID in the dump of the
MP table.  It'll either say:

        Processor #1 Pentium(tm) Pro APIC version 17

in which case the physical CPU#1 (the misbehaving one) is still in the
machine or

        Processor #0 Pentium(tm) Pro APIC version 17

in which case the bad CPU is in your hand.

At this point, I'd suggest marking the bad CPU (say with nail polish)
and *also* marking the physical slot/socket 0 on the motherboard
somehow.

You can't really tell from any of this that the "bad" CPU is flakey.
It could be a problem with the slot/socket, and it might be fixed by
simply reseating the CPU in its slot or switching the CPUs.  Even if
it is the CPU, it might be something you can fix by removing,
regreasing, and reattaching the heatsink (assuming it isn't
permanently bonded to the processor) or just attaching a better fan.

In any event, you really want to do some intensive testing before you
throw anything away:  try the "bad" CPU in both slots and make sure it
fails reliably.  Try to "good" CPU, too, and make sure it *works*
reliably.

Good luck!

-- 
Kevin Buhr <buhr@telus.net>



Reply to: