Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

To: Artur Linhart - Linux communication <AL.LINUX@bcpraha.com>, 596419@bugs.debian.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
From: Ian Campbell <ijc@hellion.org.uk>
Date: Mon, 13 Sep 2010 10:09:12 +0100
Message-id: <[🔎] 1284368952.14311.14312.camel@zakaz.uk.xensource.com>
Reply-to: Ian Campbell <ijc@hellion.org.uk>, 596419@bugs.debian.org
In-reply-to: <[🔎] 0A80B7B84B4F4DB5B83C6ECEA1BB815C@private.praha.bcpraha.com>
References: <[🔎] B2AAFD68E3464B8A93C2224C75BA4AE3@private.praha.bcpraha.com> <handler.596419.B596419.128423333723472.ackinfo@bugs.debian.org> <[🔎] 0A80B7B84B4F4DB5B83C6ECEA1BB815C@private.praha.bcpraha.com>

(Konrad, this looks potentially swiotlb like, what do you think? Full
bug log is at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596419 )

> On Sun, 2010-09-12 at 07:58 +0200, Artur Linhart - Linux communication wrote:
Even after the downgrade of kernel and of the corresponding files to the
> version 2.6.32-18 and downgrade of mdadm the problem still persists, so it
> is not bound specificallz to this package and to this version. 

Thanks Artur, if possible can you reproduce with a serial console
connected so that you can capture precise logs? If not then it might be
worth posting a digital photograph of the stack trace somewhere -- there
is often a bunch of useful information preceding the actual stack trace,
you can usually use Shift-PgUp to go back to earlier messages.

Also, can you look in /var/log/kern.log* and see if any of the errors
made it there, which is possible if your root partition isn't on the
same device that is failing.

The reference to aac_build_sgraw and BUG at
drivers/scsi/aacraid/aachba.c:2825 both point to 
        nseg = scsi_dma_map(scsicmd);
        BUG_ON(nseg < 0);

scsi_dma_map turns into dma_map_sg which in turn probably goes via
SWIOTLB on Xen but possibly does not when running under native.

Perhaps your system is running out of TLB memory and adding
"swiotlb=<NN>" to the command line will help? You should see a log
message on boot telling you how big the swiotlb is at the moment,
perhaps try doubling it? I'm not sure but I think the default is 64M
which == 32768 slabs, perhaps try swiotlb=65536?

I'm not aware of any swiotlb related fixes going into xen.git since
e73f4955a821f850f5b88c32d12a81714523a95f, which is what package
2.6.32-21 contains.

I'm not sure why any of this would tie in with shutting down domains
though.

Ian.

> I have identified now (after the downgrades to 2.6.30-18) the following
> initial stack trace (some lines are missing from the top, I think, they were
> no longer on the screen):
> 
> [<....>] ? bio_alloc_bioset+0x45/0xb7
> [<....>] ? submit_bio+0xd6/0xf2
> [<....>] ? md_super_write+0x84/0xb2 [md_mod]
> [<....>] ? xen_restore_fl_direct_end+0x0/0x1
> [<....>] ? md_update_sb+0x268/0x31e
> [<....>] ? md_check_recovery+0x1e2/0x4b9 [md_mod]
> [<....>] ? raid1d+0x42/0xe0b [raid1]
> [<....>] ? finish_task_switch+0x44/0xaf
> [<....>] ? schedule_timeout+0x2e/0xdd
> [<....>] ? xen_restore_fl_direct_end+0x0/0x1
> [<....>] ? xen_force_evtchn_callback+0x9/0xa
> [<....>] ? check_events+0x12/0x20
> [<....>] ? xen_restore_fl_direct_end+0x0/0x1
> [<....>] ? md_thread+0xf1/0x10f [md_mod]
> [<....>] ? autoremove_wake_function+0x0/0x2e
> [<....>] ? md_thread+0x0/0x10f [md_mod]
> [<....>] ? kthread+0x79/0x01
> [<....>] ? child_rip+0xa/0x20
> [<....>] ? int_ret_from_szs_call+0x7/0x1b
> [<....>] ? retinit_restore_args+0x5/0x6
> [<....>] ? xen-restore-fl-direct-end+0x0/0x1
> [<....>] ? xen-restore-fl-direct-end+0x0/0x1
> [<....>] ? child_rip+0x0/0x20
> Code: 00 00 c7 46 0c 00 00 00 00 c7 46 10 00 00 00 00 c7 46 14 00
> 00 00 00 c7 46 18 00 00 00 00 e8 10 63 fa ff 83 f8 00 41 89 c6 7d 04 <0f> 0b
> eb
> fe 75 08 45 31 e4 e9 9c 00 00 00 49 8b 7f 58 48 89 eb
> RIP [<....>] aac_build_sgraw+0x51/0x10a [aacraid]
>  RSP <ffff88003cd998e0>
> --- [ end trace .... ] ---  
> 
> Now also this stack trace stays on the screen and nothing happens also after
> very long time (1 hour)
> 
> 
> 
> 

-- 
Ian Campbell
Current Noise: Raise Hell - Rising

Love is sentimental measles.

Reply to:

Follow-Ups:
- Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
  - From: "Artur Linhart - Linux communication" <AL.LINUX@bcpraha.com>

References:
- Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
  - From: "Artur Linhart - Linux communication" <AL.LINUX@bcpraha.com>
- Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
  - From: "Artur Linhart - Linux communication" <AL.LINUX@bcpraha.com>

Prev by Date: AES encryption for NFS4 in Debian Squeeze?
Next by Date: Bug#585864: I get a similar bug
Previous by thread: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
Next by thread: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)
Index(es):
- Date
- Thread