[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: System freezes, time jumps 3.25 days (was Re: sparc buildd issues)



Dave,

Below is a patch that closes a race condition reading the stick register
on Hummingbird cpus. Previously the code always incremented the high
word if the low had wrapped but we don't know if high was read before or
after the wrap.

This may (or may not :-) ) be the cause of a kernel lockup seen on a
number of Hummingbird systems running the Debian kernel 2.6.14 where
time seems to jump forward 3 days 6 hours shortly before the system
locks up.

Patch below.

Richard

<-- snip -->

Ensure STICK register value is read properly when register roll over
occurs.

Signed-off-by: Richard Mortimer <richm@oldelvet.org.uk>



--- linux-2.6-2.6.14+2.6.15-rc5/arch/sparc64/kernel/time.c.orig 2006-01-03 14:27:36.000000000 +0000
+++ linux-2.6-2.6.14+2.6.15-rc5/arch/sparc64/kernel/time.c      2006-01-03 22:20:19.000000000 +0000
@@ -280,9 +280,9 @@
  * Since STICK is constantly updating, we have to access it carefully.
  *
  * The sequence we use to read is:
- * 1) read low
- * 2) read high
- * 3) read low again, if it rolled over increment high by 1
+ * 1) read high
+ * 2) read low
+ * 3) read high again, if it rolled re-read both low and high again.
  *
  * Writing STICK safely is also tricky:
  * 1) write low to zero
@@ -295,18 +295,18 @@
 static unsigned long __hbird_read_stick(void)
 {
        unsigned long ret, tmp1, tmp2, tmp3;
-       unsigned long addr = HBIRD_STICK_ADDR;
+       unsigned long addr = HBIRD_STICK_ADDR+8;

        __asm__ __volatile__("ldxa      [%1] %5, %2\n\t"
-                            "add       %1, 0x8, %1\n\t"
-                            "ldxa      [%1] %5, %3\n\t"
+                            "1:\n\t"
                             "sub       %1, 0x8, %1\n\t"
+                            "ldxa      [%1] %5, %3\n\t"
+                            "add       %1, 0x8, %1\n\t"
                             "ldxa      [%1] %5, %4\n\t"
                             "cmp       %4, %2\n\t"
-                            "blu,a,pn  %%xcc, 1f\n\t"
-                            " add      %3, 1, %3\n"
-                            "1:\n\t"
-                            "sllx      %3, 32, %3\n\t"
+                            "bne,a,pn  %%xcc, 1b\n\t"
+                            " mov      %4, %2\n"
+                            "sllx      %4, 32, %4\n\t"
                             "or        %3, %4, %0\n\t"
                             : "=&r" (ret), "=&r" (addr),
                               "=&r" (tmp1), "=&r" (tmp2), "=&r" (tmp3)




On Sat, 2005-12-31 at 00:41 +0000, Richard Mortimer wrote:
> On Thu, 2005-12-29 at 16:46 -0800, Jurij Smakov wrote:
> > On Wed, 28 Dec 2005, Blars Blarson wrote:
> > 
> > > In article <3c3e3fca0512271004y7ee667f1tf6d4ef2ac72282b2@mail.gmail.com>
> > > bill@herrin.us writes:
> > >
> > >> On a Sparc netra X1, the system partially freezes (some stuff continues
> > >> running but at least one of the operations necessary to log in gets stuck).
> > >> The logs show that as of the moment of the freeze, the clock has jumped
> > >> forward exactly 3 days, 6 hours,
> > >> 11 minutes and 15 seconds. The change is not gradual; it jumps between
> > >> syslog marks set a minute apart.
> > >
> > > This is not what I have seen.  It sounds like an unrelated issue.
> > 
> > Yeah, I haven't heard about the jumping time issue before. It was reported 
> > that it is absent in 2.6.14, could you please test it? If it's really 
> > gone, it would be the easiest way out.
> 
> I have seen occasional freezes on a Netra X1 running 2.6.14-2.
> Previously I had just put it down to bad hardware or power supply
> glitches (I only use the machine occasionally for testing stuff out).
> Now having seen these discussions I have looked back in my logs and can
> see at least one occurrence of something that looks like the 3 days 6
> hours stuff. I see entries in syslog that jump something like that time
> before for a couple of entries and then the machine hangs until I notice
> and powercycle.
> 
> Anyway that got me thinking as to what may cause this sort of thing.
> Past experience suggests that it would probably be caused by an overflow
> of the low 32 bits of a 64 bit counter or something like that. I
> couldn't make any of the clock frequencies that the machine claims to
> use look anything sensible.
> 
> But I did notice something in __hbird_read_stick in
> arch/sparc64/kernel/time.c 
> The comment (and indeed the code) says that it has to read two 32 bit
> registers in I/O space and that it has to take care of overflow using
> the following sequence.
> 
>  * The sequence we use to read is:
>  * 1) read low
>  * 2) read high
>  * 3) read low again, if it rolled over increment high by 1
> 
> Now to me it seems that if we see the low roll over then always
> incrementing high could be wrong because we could have read it before or
> after the roll over. I think a better solution would be as follows:
> 
> 1) read high
> 2) read low
> 3) read high, if high changed then start again
> 
> I do not know how this could cause the problem because the counter runs
> at 5.5Mhz and would overflow every 773 seconds or so. But maybe the
> possiblility of hitting this every 10-15 minutes combined with another
> problem could give us the symptoms that we see.
> 
> Richard
> 
> P.S. Sorry but I don't have time to generate a patch today but will do
> so early next week unless someone else has beaten me to it.
> 
> -- 
> Richard Mortimer <richm@oldelvet.org.uk>
> 
> 
-- 
Richard Mortimer <richm@oldelvet.org.uk>



Reply to: