Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100

To: Paul Elliott <paul.elliott@ynic.york.ac.uk>
Cc: 637085@bugs.debian.org
Subject: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
From: Jonathan Nieder <jrnieder@gmail.com>
Date: Mon, 8 Aug 2011 20:36:06 +0200
Message-id: <[🔎] 20110808183606.GD4222@elie.gateway.2wire.net>
Reply-to: Jonathan Nieder <jrnieder@gmail.com>, 637085@bugs.debian.org
In-reply-to: <[🔎] 4E401891.3010002@ynic.york.ac.uk>
References: <[🔎] 20110808122424.2597.81303.reportbug@sulcus.ynic.york.ac.uk> <[🔎] 20110808151643.GA20726@elie.gateway.2wire.net> <[🔎] 4E401891.3010002@ynic.york.ac.uk>

Paul Elliott wrote:

> I'm no expert at reading these but I believe it is the same. Here's the
> trace after the next reboot/lock up cycle:
>
> kernel BUG at [...]/mm/slub.c:2969!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-3/target1:0:1/1:0:1:0/block/sdj/stat
> CPU 0
> Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext4 jbd2 crc16 ext2 dm_round_robin dm_multipath scsi_dh loop sd_mod crc_t10dif snd_pcm joydev snd_timer snd soundcore snd_page_alloc usbhid hid evdev pcspkr hpilo hpwdt psmouse power_meter container processor button serio_raw ext3 jbd mbcache dm_mod hpsa cciss uhci_hcd ehci_hcd qla2xxx usbcore scsi_transport_fc nls_base scsi_tgt scsi_mod be2net thermal thermal_sys [last unloaded: scsi_wait_scan]
> Pid: 1845, comm: ext4-dio-unwrit Not tainted 2.6.32-5-amd64 #1 ProLiant BL460c G7
> RIP: 0010:[<ffffffff810e730b>]  [<ffffffff810e730b>] kfree+0x55/0xcb

Not identical.  This time it is at mm/slub.c:2969, which is

	BUG_ON(!PageCompound(page));

checking that the result from virt_to_head_page(x) is sane in
kfree().

But in both cases, ext4_end_aio_dio_work is at the top of the
stack.  That could be because it is almost always at the top of the
stack (your workload) or because corruption happens before it's called
and always gets detected around then.

I don't have many ideas.  Would it be possible to try version 3.0.0-1
from sid to see if it exhibits the same problem, and if so, report
this upstream at bugzilla.kernel.org, product File System, component
ext4, and let us know the bug number?

Thanks and sorry for the trouble,
Jonathan

Reply to:

References:
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Paul Elliott <paul.elliott@ynic.york.ac.uk>
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Jonathan Nieder <jrnieder@gmail.com>
- Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
  - From: Paul Elliott <paul.elliott@ynic.york.ac.uk>

Prev by Date: Processed: Re: Bug#614622: linux-image-2.6.37-1-686: atl2 NIC claims NO CARRIER after suspend/resume; rmmod+insmod fixes the problem
Next by Date: Re: Re: squeeze 2.6.32-5-686 doesn´t boot without monitor
Previous by thread: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Next by thread: Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100
Index(es):
- Date
- Thread