[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug



> I am not sure that would really help.
> Are you sure that it couldn't be a hardware problem.

I don't see any hardware problems in the log before the kernel oopses. If
there were, if there are hardware issues, then it's the kernel fault that
nothing gets reported. 

The only think I can think of is that there might be some (unreported by
the kernel) hard drive problems which doesn't get reported by the kernel
and when it tries to use the swap space it cannot read/write to it and this
generates the oopses. Isnt' there a tool to test the swapspace? (besides 
'mkswap -c')

The one thing I'm surprised about is that the oopses vary somewhat in their 
messages:

 kernel BUG at mmap.c:1172!
 kernel BUG at page_alloc.c:152!
 kernel BUG at page_alloc.c:221!

Digging the code of the first one I find it in mm/mmap.c exit_mmap():
        /* This is just debugging */
        if (mm->map_count)
                BUG();

And the page_alloc ones code are:

mm/page_alloc.c:
     84 static void FASTCALL(__free_pages_ok (struct page *page, unsigned int or der));
     85 static void __free_pages_ok (struct page *page, unsigned int order)
     86 {
(...)
    149                 buddy1 = base + (page_idx ^ -mask);
    150                 buddy2 = base + page_idx;
    151                 if (BAD_RANGE(zone,buddy1))
    152                         BUG();
    153                 if (BAD_RANGE(zone,buddy2))
    154                         BUG();
(...)
    203 static struct page * rmqueue(zone_t *zone, unsigned int order)
    204 {
(...)
    219                         page = list_entry(curr, struct page, list);
    220                         if (BAD_RANGE(zone,page))
    221                                 BUG();


I don't have an in depth knowledge of the kernel, but I don't believe that
hardware issues can make the above code generate those BUG(). It looks to
me that somehow, the kernel is not handling its swap definitions properly.

Can you figure up a way in which I could reproduce these errors and maybe 
trace the kernel to see what's going on?

> It seems to be rather intermittend and do not have any
> other reports of similar failures.

The "intermittency" might be related to the fact that it's a problem in the 
cleanup of swap pages, when swap is not used, the problem does not show 
up. For what it's worth, in my system:

$ free
             total       used       free     shared    buffers     cached
Mem:        386156     381800       4356          0      15712     258080
-/+ buffers/cache:     108008     278148
Swap:       979956       1088     978868

So swap is not usually used. The oops seem to appear when cron jobs make 
intensive use of the system and the swap usage goes up and down.

> Pending a way to reliably reproduce the problem, 
> or at least some confirmation that it manifests on
> different hardware I have changed the severity to important.

I understand this but I would appreciate some indication on how to debug 
this issue myself if necessary and trace what the kernel is hitting 
against.

Regards

Javier

Attachment: signature.asc
Description: Digital signature


Reply to: