On Tue, 2009-08-18 at 20:38 +0400, Nikita V. Youshchenko wrote:
> Package: linux-image-2.6.26-2-xen-amd64
> Version: 2.6.26-17lenny2
> Severity: normal
>
> I'm running several servers (Dual Core AMD Opteron, 16G of ram) with
> Xen.
>
> Dom0 is debian lenny adm64.
>
> Today I tried to run a large backup in dom0 - copy NFS-mounted partition
> to a local volume group.
>
> buki:~# cd backup/
> buki:~/backup# mount homes:/srv/homes src
> buki:~/backup# mount /dev/vg/homes-backup dst
> buki:~/backup# (cd src && tar cpf - .) | (cd dst && tar xpf -)
>
> This resulted into server crash in a few minutes.
>
> Crash is reproducible - if after reboot I try the same, same crash
> happens soon with similary-looking backtrace.
>
> [ 712.450538] ------------[ cut here ]------------
> [ 712.450538] kernel BUG at drivers/xen/core/spinlock.c:74!
> [ 712.450538] invalid opcode: 0000 [1] SMP
> [ 712.450538] CPU 0
> [ 712.450538] Modules linked in: nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs sctp libcrc32c ipv6 lock_dlm gfs2 dlm configfs bridge 8021q dm_emc dm_round_robin dm_multipath loop psmouse pcspkr serio_raw k8temp i2c_amd8111 i2c_amd756 i2c_core button amd_rng rng_core shpchp pci_hotplug evdev reiserfs dm_mirror dm_log dm_snapshot dm_mod raid456 async_xor async_memcpy async_tx xor raid1 md_mod ide_cd_mod cdrom ide_pci_generic sd_mod amd74xx ide_core floppy lpfc scsi_transport_fc scsi_tgt tg3 e100 mii sata_sil ata_generic ohci_hcd libata scsi_mod dock thermal processor fan thermal_sys
> [ 712.450538] Pid: 5636, comm: tar Not tainted 2.6.26-2-xen-amd64 #1
> [ 712.450538] RIP: e030:[<ffffffff8037fa9c>] [<ffffffff8037fa9c>] xen_spin_wait+0x90/0x139
> [ 712.450538] RSP: e02b:ffffffff80595e38 EFLAGS: 00010246
> [ 712.450538] RAX: ffff88001b1adfd8 RBX: ffff880001bbda20 RCX: ffff88008162e000
> [ 712.450538] RDX: ffffffffff5f7000 RSI: 0000000000008483 RDI: 0000000000000000
> [ 712.450538] RBP: 0000000000000103 R08: 00000000dc38d979 R09: ffffffffffffffff
> [ 712.450538] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 712.450538] R13: ffff88001fe7f698 R14: ffff88001fe7f698 R15: 0000000000000000
> [ 712.450538] FS: 00007f3ef40eb6e0(0000) GS:ffffffff80539000(0000) knlGS:0000000000000000
> [ 712.450538] CS: e033 DS: 0000 ES: 0000
> [ 712.450538] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 712.450538] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 712.450538] Process tar (pid: 5636, threadinfo ffff88001b1ac000, task ffff88001e270280)
> [ 712.450538] Stack: ffff88001da47480 ffff88001fe7f698 0000000000008483 0000000000000000
> [ 712.450538] 0000000000000013 ffffffff80435886 0000000000000000 ffff88001d8f8700
> [ 712.450538] 0000000000000000 ffffffffa00854f4 0000000000000000 ffffc2000001ac00
> [ 712.450538] Call Trace:
> [ 712.450538] <IRQ> [<ffffffff80435886>] ? _spin_lock+0x3a/0x42
> [ 712.450538] [<ffffffffa00854f4>] ? :sata_sil:sil_interrupt+0x2b/0x275
> [ 712.450538] [<ffffffff8025f0e6>] ? handle_IRQ_event+0x47/0x8e
> [ 712.450538] [<ffffffff802605a2>] ? handle_level_irq+0xae/0x116
> [ 712.450538] [<ffffffff8020e13e>] ? do_IRQ+0x4e/0x9a
> [ 712.450538] [<ffffffff8037d4cc>] ? evtchn_do_upcall+0x13c/0x1fc
> [ 712.450538] [<ffffffff8020bbde>] ? do_hypervisor_callback+0x1e/0x30
> [ 712.450538] <EOI> [<ffffffff802e48d4>] ? dummy_inode_setattr+0x0/0x3
> [ 712.450538] [<ffffffff8037cbf9>] ? xen_poll_irq+0x67/0x74
> [ 712.450538] [<ffffffff8037fb06>] ? xen_spin_wait+0xfa/0x139
> [ 712.450538] [<ffffffff80435cfd>] ? lock_kernel+0x4f/0x63
> [ 712.450538] [<ffffffffa01dcd12>] ? :reiserfs:reiserfs_setattr+0x25/0x27d
> [ 712.450538] [<ffffffff8031156f>] ? __up_read+0x13/0x8a
> [ 712.450538] [<ffffffffa01f4064>] ? :reiserfs:reiserfs_getxattr+0xa9/0xbe
> [ 712.450538] [<ffffffff8029e7e3>] ? notify_change+0x174/0x2f5
> [ 712.450538] [<ffffffff80288db5>] ? chown_common+0x8c/0xa1
> [ 712.450538] [<ffffffff802a1923>] ? mnt_want_write+0x31/0x86
> [ 712.450538] [<ffffffff80288f83>] ? sys_chown+0x4b/0x6f
> [ 712.450538] [<ffffffff8029a9d8>] ? dput+0x35/0x13e
> [ 712.450538] [<ffffffff802ab10e>] ? sys_futimesat+0x88/0x9e
> [ 712.450538] [<ffffffff8020b528>] ? system_call+0x68/0x6d
> [ 712.450538] [<ffffffff8020b4c0>] ? system_call+0x0/0x6d
> [ 712.450538]
> [ 712.450538]
> [ 712.450538] Code: 30 fa 58 80 4c 39 2c 08 75 04 0f 0b eb fe 48 c7 c0 40 fa 58 80 eb 1f 65 48 8b 04 25 10 00 00 00 66 f7 80 44 e0 ff ff 00 ff 75 04 <0f> 0b eb fe 48 c7 c0 30 fa 58 80 48 8d 1c 08 48 83 3b 00 74 04
> [ 712.450538] RIP [<ffffffff8037fa9c>] xen_spin_wait+0x90/0x139
> [ 712.450538] RSP <ffffffff80595e38>
> [ 712.450538] ---[ end trace 73e60cdc01c1f34c ]---
> [ 712.450538] Kernel panic - not syncing: Aiee, killing interrupt handler!
[...]
The kernel was spinning in process context, was interrupted by the SATA
device, and its interrupt handler also started spinning.
The BUG line is here:
/* announce we're spinning */
spinning = &__get_cpu_var(spinning);
if (spinning->lock) {
BUG_ON(spinning->lock == lock);
if(raw_irqs_disabled()) {
BUG_ON(__get_cpu_var(spinning_bh).lock == lock);
spinning = &__get_cpu_var(spinning_irq);
} else {
-> BUG_ON(!in_softirq());
spinning = &__get_cpu_var(spinning_bh);
}
BUG_ON(spinning->lock);
}
spinning->ticket = token;
smp_wmb();
spinning->lock = lock;
This asserts that if we spin on a lock after interrupting another spin,
and interrupts are enabled, we must be in a softirq.
This seems bogus to me - in general, interrupts are enabled during
interrupt handlers once their specific IRQ has been masked.
I'll have a look at whether & how this code has changed upstream and in
other forward-ported branches.
Ben.
--
Ben Hutchings
The generation of random numbers is too important to be left to chance.
- Robert Coveyou
Attachment:
signature.asc
Description: This is a digitally signed message part