Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug

To: Horms <horms@verge.net.au>
Cc: 255175@bugs.debian.org
Subject: Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
From: Javier Fernández-Sanguino Peña <jfs@computer.org>
Date: Thu, 29 Jul 2004 09:20:29 +0200
Message-id: <[🔎] 20040729072029.GA14201@dat.etsit.upm.es>
Reply-to: Javier Fernández-Sanguino Peña <jfs@computer.org>, 255175@bugs.debian.org
In-reply-to: <[🔎] 20040729020731.GB26307@verge.net.au>
References: <[🔎] 20040726185719.GA30849@dat.etsit.upm.es> <[🔎] 20040727014607.GD24601@verge.net.au> <[🔎] 20040727061854.GA7408@dat.etsit.upm.es> <[🔎] 20040727070411.GA28789@verge.net.au> <[🔎] 20040727233347.GC30468@dat.etsit.upm.es> <[🔎] 20040729020731.GB26307@verge.net.au>

> I am not sure that would really help.
> Are you sure that it couldn't be a hardware problem.

I don't see any hardware problems in the log before the kernel oopses. If
there were, if there are hardware issues, then it's the kernel fault that
nothing gets reported. 

The only think I can think of is that there might be some (unreported by
the kernel) hard drive problems which doesn't get reported by the kernel
and when it tries to use the swap space it cannot read/write to it and this
generates the oopses. Isnt' there a tool to test the swapspace? (besides 
'mkswap -c')

The one thing I'm surprised about is that the oopses vary somewhat in their 
messages:

 kernel BUG at mmap.c:1172!
 kernel BUG at page_alloc.c:152!
 kernel BUG at page_alloc.c:221!

Digging the code of the first one I find it in mm/mmap.c exit_mmap():
        /* This is just debugging */
        if (mm->map_count)
                BUG();

And the page_alloc ones code are:

mm/page_alloc.c:
     84 static void FASTCALL(__free_pages_ok (struct page *page, unsigned int or der));
     85 static void __free_pages_ok (struct page *page, unsigned int order)
     86 {
(...)
    149                 buddy1 = base + (page_idx ^ -mask);
    150                 buddy2 = base + page_idx;
    151                 if (BAD_RANGE(zone,buddy1))
    152                         BUG();
    153                 if (BAD_RANGE(zone,buddy2))
    154                         BUG();
(...)
    203 static struct page * rmqueue(zone_t *zone, unsigned int order)
    204 {
(...)
    219                         page = list_entry(curr, struct page, list);
    220                         if (BAD_RANGE(zone,page))
    221                                 BUG();


I don't have an in depth knowledge of the kernel, but I don't believe that
hardware issues can make the above code generate those BUG(). It looks to
me that somehow, the kernel is not handling its swap definitions properly.

Can you figure up a way in which I could reproduce these errors and maybe 
trace the kernel to see what's going on?

> It seems to be rather intermittend and do not have any
> other reports of similar failures.

The "intermittency" might be related to the fact that it's a problem in the 
cleanup of swap pages, when swap is not used, the problem does not show 
up. For what it's worth, in my system:

$ free
             total       used       free     shared    buffers     cached
Mem:        386156     381800       4356          0      15712     258080
-/+ buffers/cache:     108008     278148
Swap:       979956       1088     978868

So swap is not usually used. The oops seem to appear when cron jobs make 
intensive use of the system and the swap usage goes up and down.

> Pending a way to reliably reproduce the problem, 
> or at least some confirmation that it manifests on
> different hardware I have changed the severity to important.

I understand this but I would appreciate some indication on how to debug 
this issue myself if necessary and trace what the kernel is hitting 
against.

Regards

Javier

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Matt Zimmerman <mdz@debian.org>

References:
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Javier Fernández-Sanguino Peña <jfs@computer.org>
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Horms <horms@debian.org>
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Javier Fernández-Sanguino Peña <jfs@computer.org>
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Horms <horms@verge.net.au>
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Javier Fernández-Sanguino Peña <jfs@computer.org>
- Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
  - From: Horms <horms@verge.net.au>

Prev by Date: Bug#260333: PowerMac G3 (beige) installing kernel-image_powerpc_2.6.7-3 on sarge
Next by Date: Bug#261776: kernel-headers-2.4.26-1-686: Missing Makefile
Previous by thread: Processed: Re: Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
Next by thread: Bug#255175: kernel-image-2.4.26-1-686: system crash due to kernel bug
Index(es):
- Date
- Thread