On Sat, 2006-04-08 at 12:55 +1000, Paul Mackerras wrote:
This patch fixes it for me on my powerbook (1.5GHz albook).  The  
issue
seems to be that the cpu objects to HID0_NAP being cleared in HID0.
If I have this code power_save_6xx_restore, it hangs:
_GLOBAL(power_save_6xx_restore)
	mfspr	r11,SPRN_HID0
	rlwinm	r11,r11,0,10,8		/* Clear NAP */
	mtspr	SPRN_HID0,r11
	b	transfer_to_handler_cont
If I take out that rlwinm, it boots.  Bizaare.
If you do that, you cause the transfer_to_handler to always call
power_save_6xx_restore even when not coming back from idle.
I did a bit more tracking and it's very strange.... At first, I
discovered that adding a printk after the call to power_save fixed  
it. I
did all sort of tests and if my memory serves me well, a simple mb()
there will fix it too. In fact, what I noticed is that if I do
 if (mfmsr() & MSR_POW)
	printk("GACK !\n");
After calling ppc_md.power_save() and before local_irq_enable(), it  
does
trigger ! But with an mb() just before, it doesn't. In fact, you don't
need an mb()... all you need is another mfmsr(). Thus a dummy msmsr()
"fixes" the stale MSR_POW in there.
That is very dodgy. Looks like we get a stale MSR_POW upon return from
the exception that interrupted sleep, causing the next
local_irq_enable() to block forever.
The next question is how does it get there... my idea at first was  
that
we get MSR_POW in SRR1 in that exception and put it back in with rfi
(and the CPU gets it that way instead of ignoring it). Sounds like a
lovely explanation if we also add that a sync or an mfmsr "clears"  
this
weird condition. However, I added clearing of MSR_POW in r9 in
EXCEPTION_PROLOG_2() and it didn't fix it (but maybe I did something
wrong, I was tired).