[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#644550: linux-image-2.6.32-5-amd64: indefinite soft lockup on rm



On Thu, 2011-10-06 at 22:58 +0200, Egon Eckert wrote:
> Package: linux-2.6
> Version: 2.6.32-38
> Severity: important
> Tags: upstream
> 
> Hi,
> 
> the lockup happened on unlinking a large file (several GB) on XFS on fibre
> channel storage (qla2xxx module).  We experience very simiral trouble on
> this machine though, using other kernels (from stable, stable-backports and
> sid as well) and other fs/storage combinations (ext4 on local Dell PERC
> controller).

How long did the soft lockup state persist before you rebooted?

> This would normally suggest a HW problem, but this time it's
> unlikely IMO.  The machine is new and survived many hours of memtest86-ing. 
> Unfortunately we have no second identical machine to replicate it.  The only
> HW issue I admit is the availability of newer BIOS, which will be flashed
> next week (sorry if you find this to be an obvious reason of the problem).
> 
> As I said, the other kernels crash too showing similar backtraces.  I think
> the (eventual) kernel bug may relate to relatively unusual configuration of
> this machine, which is a NUMA with 64GB RAM, getting probably not that much
> beating (I mean testing) as the more common desktop PCs.

That's not a particularly big or unusual server configuration today.
And among Linux systems, desktop PCs are the oddities!  So this really
ought to be well-tested.

> Could the unending find_get_pages indicate slab corruption perhaps?

That's actually part of the page allocator, not the slab (heap)
allocator.

> The full boot-time dmesg output may be found on
> 
> http://joni.heaven-industries.com/~egon/tornado-dmesg.txt
> 
> The bug is not too easy to trigger, so I hope there's another way to look
> for it than bisecting 5 years of kernel commits...  On the other hand, I'm
> of course ready to spend a time on it if it helps to make the kernel better
> :).
[...]

When you say 'not too easy', do you mean that you don't know what
specific circumstances it occurs in, or that it occurs as part of a long
or complex sequence of operations?

What sort of applications or services is this machine running?

Given that you can reproduce this with the kernel version in sid, maybe
we should take this straight to the upstream developers - but it's hard
to guess which component or maintainer might be responsible.

Ben.

-- 
Ben Hutchings
For every action, there is an equal and opposite criticism. - Harrison

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: