[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#596419: Acknowledgement (xen-linux-system-2.6.32-5-xen-amd64: causes a system hangup by the shutdown of the system, aacraid (sw raid) involved in hangup)

Hello, Ian,

	your theory with the out-of-memory seems to be the step into the
right direction.

	It looks like the problems did not really start with the
instalaltion of the new packages, but with the set of the xen kernel
which I made approximatelly at the same time like the upgrades. If I have
removed this option now, so Dom0 has complete 12GB for its run and the
problem does not occur anymore. Also the domains are suspended correctly
after the call of 
/etc/init.d/xendomains stop

Possibly this is also the reason, why I could not reproduce this problem
with the non-xen kernel - because in that case the memory also was not
reduced to this 1GB, but the complete 12GB memory pool was used withtout any
specifications, so possibly the error could not occur as well.
Also usage of dom0_mem=2048 is not enough to fix the problem for me. I have
tried dom0_mem=2048 but it leads also to the hangup by the shutdown during
the domain suspension. Only if I omit the dom0_mem parameter completely at
all it works correctly.
Free memory after increase of the dom0_mem to 2048M:
             total       used       free     shared    buffers     cached
Mem:       2090832     448092    1642740          0     111600      90908
-/+ buffers/cache:     245584    1845248
Swap:       999416          0     999416
- so there is basically no problem with the base memory amount, there is
enough memory for everything.

According to the swiotlb parameter - I have found following lines in
kern.log from the previous reboots:

Sep 13 17:15:13 alg-puv-xen-1 kernel: [    3.105461] xen_swiotlb_fixup:
buf=ffff880005711000 size=67108864
Sep 13 17:15:13 alg-puv-xen-1 kernel: [    3.126345] xen_swiotlb_fixup:
buf=ffff880009771000 size=32768

- (so the 64MB should be there) but the given lines are repeatet there
always with the same values, independently on the fact if dom0_mem has been
set to 1024M, 2048M or unset completely. After I have specified
swiotlb=65536 on the line with the xen kernel then I got in the log the same
thing like If I would done nothing (and also the hangups during domain
suspension). If I put this parameter to the linux kernel module parameters,
then it also did not changed the value in the log:
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.856096] Kernel command line:
root=/dev/md0 ro console=tty0 vga=773 swiotlb=65536
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.856129] PID hash table entries:
4096 (order: 3, 32768 bytes)
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.856512] Initializing CPU#0
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.873864] DMA: Placing 128MB
software IO TLB between ffff880005711000 - ffff88000d711000
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.873868] DMA: software IO TLB at
phys 0x5711000 - 0xd711000
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.873871] xen_swiotlb_fixup:
buf=ffff880005711000 size=134217728
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.915338] xen_swiotlb_fixup:
buf=ffff88000d7d1000 size=32768
Sep 13 18:15:32 alg-puv-xen-1 kernel: [    3.924636] Memory:
1891528k/2097152k available (3141k kernel code, 432k absent, 205192k
reserved, 1905k data, 592k init)

But the reboot came through without the crash! :-)
Where has to be applied the swiotlb parameter to see some effect of the
swiotlb memory change in the logs?

So, it worked if I have specified in Dom0 in the "baloon" mode by omitting
the specification of dom0_mem or, if dom0_mem is specified then also the
swiotlb=65536 must be specified.

I have noticed one interesting behavior - during the successfull suspension
of the domains during the shutdown the first one which is beeing suspended
writes very fast three "dots", then it stops to write the dots for some time
and then agfter some time very fast a lot of (possibly also all remaining)
"dots" are written on the screen. By the next suspensions the suspension
works continuously dot-by-dot smoothly without any delays. It looks like it
waits for something during the first suspension (memory allocation?).

Generally, it is for me very surpsrising, how the aacraid module works, I am
no C or kernel developer but I would expect something like this cannot
happen - the module should allocate its necessary memory in the start or, I
would understand there can fail some specific read or write operation if the
sw raid has not enough memory to execute them, but I would never expect this
will lead to the hangup and freeze of the whole system. The probability of
data corruption is so increased drastically. And especially by raid1, which
is arranged in the most of cases to archieve more data safety :-).

With regards, Artur

Reply to: