[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#637085: linux-image-2.6.32-5-amd64: Hard hang following BUG: scheduling while atomic: swapper/0/0x10000100



Hi,

Paul Elliott wrote:

> We are experiencing hard lock ups when under heavy load. See below
> for the log entries we have managed to capture via remote syslog
> before the machine locks completely.

Thanks; this looks very useful.  Let's see.

> The machine is a BL460c G7 and
> is performing multiple I/O stress tests to ext4 filesystems
> presented over FC via a qlogic card connected to a P2000 MSA G3.
[...]
> kernel BUG at [...]/kernel/workqueue.c:287!
> invalid opcode: 0000 [#1] SMP 
> last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:06:00.1/host1/rport-1:0-0/target1:0:0/1:0:0:2/block/sdh/stat
> CPU 0 
> Modules linked in: xfs exportfs ext4 jbd2 crc16 ext2 dm_round_robin dm_multipath scsi_dh loop sd_mod crc_t10dif snd_pcm snd_timer snd soundcore hpwdt snd_page_alloc hpilo joydev psmouse power_meter evdev container serio_raw button pcspkr processor ext3 jbd mbcache usbhid hid dm_mod hpsa qla2xxx uhci_hcd scsi_transport_fc cciss scsi_tgt ehci_hcd usbcore nls_base scsi_mod thermal thermal_sys be2net [last unloaded: scsi_wait_scan]
> Pid: 1996, comm: ext4-dio-unwrit Not tainted 2.6.32-5-amd64 #1 ProLiant BL460c G7

First BUG.

[...]
> Code: 08 48 8b 50 08 48 89 51 08 48 89 0a 48 89 00 48 89 40 08 66 ff 45 00 fb 66 0f 1f 44 00 00 49 8b 45 f8 48 83 e0 fc 48 39 c5 74 04 <0f> 0b eb fe f0 41 80 65 f8 fe 4c 89 e7 ff 54 24 38 48 8b 44 24 
> RIP  [<ffffffff810618d6>] worker_thread+0x177/0x21d
>  RSP <ffff880587189e40>

scripts/decodecode tells us the invalid opcode is ud2 from

	BUG_ON(get_wq_data(work) != cwq);

tripping.  This sanity check was introduced in ancient times
(v2.5.41~34^2~1^2, Workqueue Abstraction, 2002-09-30) and it
failing indicates that cwq's worklist was corrupted somehow.

I assume this is fairly reproducible even after a reboot?  Is
the stacktrace from the first sign of trouble in dmesg always
the same?  Did this machine work well with other kernels
before (and if so, which ones)?

If you get a chance to run memtest68+, that would also be
useful, of course.

Hope that helps,
Jonathan



Reply to: