[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)



>-----Original Message-----
>From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] 
>Sent: Monday, September 20, 2010 6:08 PM
>To: Artur Linhart - Linux communication
>Cc: 'Ian Campbell'; 596419@bugs.debian.org
>Subject: Re: Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system >hangup by the shutdown of the
system, aacraid (sw raid) involved in hangup)
>
>> So, it worked if I have specified in Dom0 in the "baloon" mode by omitting
>> the specification of dom0_mem or, if dom0_mem is specified then also the
>> swiotlb=65536 must be specified.
>
>Wow. That implies that AACRAID uses quite a lot of buffers, and looking at the driver
>there are a bunch of quirks where it can only do DMA up to 2GB, so that would explain
>why it relies on SWIOTLB that much.

Unfortunatelly I did not tried to raise dom0_mem higher than 2 GB :-(. 

>
>Based on what Ian analyzed it really looks that we just ran out of DMA buffers and
>the driver didn't try to retry but just bails out.
>
>We can narrow down who is using so many buffers by using the attached debug module
>that when loaded will print out who is using what buffers if
>CONFIG_DMA_API_DEBUG=y is set.
>
>But the proper workaround is the one you discovered - either raise the SWIOTLB buffer
>or raise the memory allocated for Dom0.
>
>> 
>> I have noticed one interesting behavior - during the successfull suspension
>> of the domains during the shutdown the first one which is beeing suspended
>> writes very fast three "dots", then it stops to write the dots for some time
>> and then agfter some time very fast a lot of (possibly also all remaining)
>> "dots" are written on the screen. By the next suspensions the suspension
>> works continuously dot-by-dot smoothly without any delays. It looks like it
>> waits for something during the first suspension (memory allocation?).
>
>That usually means that is stuck waiting for the disks to write out all the data.

OK, I thought it too, but in the case if I omitted dom0_mem or specified the higher swiotlb this behaved differently and I think, it
should behave in the same way, isn't it? At least I would guess it so... 

>> 
>> Generally, it is for me very surpsrising, how the aacraid module works, I am
>> no C or kernel developer but I would expect something like this cannot
>> happen - the module should allocate its necessary memory in the start or, I
>> would understand there can fail some specific read or write operation if the
>> sw raid has not enough memory to execute them, but I would never expect this
>> will lead to the hangup and freeze of the whole system. The probability of
>
> Well, to be honest, we engineers aren't known for testing all of the failure paths
> as well as we should. That is why folks like you are quite helpful in finding
> bugs :-)

I am always very pleased to have the possibility to help You all who are doing such a great job at least with some small piece of
work - even if it did cost me unexpectedly much time :-) I actually began with the usage of the HW RAID on that server instead of SW
raid - from other reasons. But at this time I still have the HDD with the SW raid configuration and I would be able to test
something, if You have some ideas or want to let me test something concrete on my configuration.
If not, I want to remove the software raid sometimes in the next week completely because I need this HDD, so let me know till that
time, if there is something You would need to test - I do not know, how difficult would it be for You to reproduce the error on
other machine(s). I think it should not be so difficult but who knows....






Reply to: