Hi Ian, On 05/04/12 01:00, Ian Campbell wrote:
Hi Quintin, Thanks for your report. On Wed, 2012-04-04 at 13:54 +1200, Quintin Russ wrote:Package: linux-image-2.6.32-5-xen-amd64 Version: 2.6.32-39 Severity: important We have observed an issue when a Xen dom0 is removing a snapshot for a logical volume and another process comes along to create a snapshot for that same device (different names) causing the server to Kernel Ooops. According to my logs sometimes removing of the snapshot can pause or take a while contributing to the issue. Attempts to add locking code (using dotlockfile) have not so far been successful in mitigating this bug, but we are still exploring this option. The nodes that are affected intermittently& we have been unable to reproduce this issue in the lab (on either the same model of hardware or hardware that has crashed in production). From our logs we can see that every time this issue occurs one process has been removing the snapshot while another has been creating a snapshot shortly after (seconds normally). We are currently seeing about a 5% chance of a crash per month (assuming our nodes are equal). This bug looks similar to a number of bugs that have already been filed related to this issue:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=614400 A quick Google search shows many more (which have mostly been merged): https://www.google.co.nz/webhp?q=site%3Abugs.debian.org%20xen% 20snapshot%20kernel%20oops%20squeezeThose issues were believed to be fixed in 2.6.32-34 and you are running 2.6.32-39 so either this is a different issue (perhaps with similar symptoms) or the issue isn't really fixed. Either way I think we need to see your kernel logs containing the actual oops in order to make any progress.
Yes, we have been having this problem since before 2.6.32-34 and were very hopeful that change would fix it. This sadly was not the case. Unfortunately there isn't anything in the logs for this, but I have a screenshot from the console, which I have attached.
I also had an idle shell at the time the server crashed and this is what I saw:
Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.000629] Oops: 0000 [#1] SMP Message from syslogd@dom0 at Apr 4 01:37:22 ...kernel:[4805213.000661] last sysfs file: /sys/devices/virtual/block/dm-49/removable
Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.001891] Stack: Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.002101] Call Trace: Message from syslogd@dom0 at Apr 4 01:37:22 ...kernel:[4805213.002540] Code: 66 ff 05 c9 83 58 00 48 89 ef e8 db 7a f7 ff 48 89 df e8 7f fe ff ff e8 51 b0 21 00 48 c7 c7 e0 99 67 81 e8 3b c0 21 00 48 8b 1b <48> 8b 03 48 81 fb 90 d1 48 81 0f 18 08 0f 85 64 ff ff ff 66 ff
Message from syslogd@dom0 at Apr 4 01:37:22 ... kernel:[4805213.002901] CR2: 0000000000000000 Please let me know if there is anything further I can provide.
Description: PNG image