[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

System freezes, time jumps 3.25 days (was Re: sparc buildd issues)



On Thu, 2005-12-29 at 16:46 -0800, Jurij Smakov wrote:
> On Wed, 28 Dec 2005, Blars Blarson wrote:
> 
> > In article <[🔎] 3c3e3fca0512271004y7ee667f1tf6d4ef2ac72282b2@mail.gmail.com>
> > bill@herrin.us writes:
> >
> >> On a Sparc netra X1, the system partially freezes (some stuff continues
> >> running but at least one of the operations necessary to log in gets stuck).
> >> The logs show that as of the moment of the freeze, the clock has jumped
> >> forward exactly 3 days, 6 hours,
> >> 11 minutes and 15 seconds. The change is not gradual; it jumps between
> >> syslog marks set a minute apart.
> >
> > This is not what I have seen.  It sounds like an unrelated issue.
> 
> Yeah, I haven't heard about the jumping time issue before. It was reported 
> that it is absent in 2.6.14, could you please test it? If it's really 
> gone, it would be the easiest way out.

I have seen occasional freezes on a Netra X1 running 2.6.14-2.
Previously I had just put it down to bad hardware or power supply
glitches (I only use the machine occasionally for testing stuff out).
Now having seen these discussions I have looked back in my logs and can
see at least one occurrence of something that looks like the 3 days 6
hours stuff. I see entries in syslog that jump something like that time
before for a couple of entries and then the machine hangs until I notice
and powercycle.

Anyway that got me thinking as to what may cause this sort of thing.
Past experience suggests that it would probably be caused by an overflow
of the low 32 bits of a 64 bit counter or something like that. I
couldn't make any of the clock frequencies that the machine claims to
use look anything sensible.

But I did notice something in __hbird_read_stick in
arch/sparc64/kernel/time.c 
The comment (and indeed the code) says that it has to read two 32 bit
registers in I/O space and that it has to take care of overflow using
the following sequence.

 * The sequence we use to read is:
 * 1) read low
 * 2) read high
 * 3) read low again, if it rolled over increment high by 1

Now to me it seems that if we see the low roll over then always
incrementing high could be wrong because we could have read it before or
after the roll over. I think a better solution would be as follows:

1) read high
2) read low
3) read high, if high changed then start again

I do not know how this could cause the problem because the counter runs
at 5.5Mhz and would overflow every 773 seconds or so. But maybe the
possiblility of hitting this every 10-15 minutes combined with another
problem could give us the symptoms that we see.

Richard

P.S. Sorry but I don't have time to generate a patch today but will do
so early next week unless someone else has beaten me to it.

-- 
Richard Mortimer <richm@oldelvet.org.uk>



Reply to: