[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: VM crashes with linux-image-4.5.0-0.bpo.2-amd64:amd64/jessie-backports on KVM-Host



Hi Nicholaѕ,
thanks for your mail and sorry for my late answer.
I did some more examination and that fixed the
problem somehow.
On 30/05/16 22:43 -0400, Nicholas D Steeves wrote:
"On 30 May 2016 at 03:32, Rolf Kutz <rk@vzsze.de> wrote:
Going back to linux-image-4.4.0-0.bpo.1-amd64 on the KVM-Server makes the
problem
disappear. Other VMs on that server run fine with both kernels. The VM in
question is an NFS4-Server running Debian
Jessie. The KVM-Images are on a BTRFS-Partition.

Just to clarify: 1. What kernel version is the VM host system?  2.
Have you installed mcelog and edac-utils on the host system?  3. Are
there any kernel, mce, or edac errors on the *host* system when you
encounter this crash?

1.) linux-image-4.5.0-0.bpo.2-amd64 when the problem was
happening, linux-image-4.4.0-0.bpo.1-amd64 when
not.

2.) I had those utilities installed and they
didn't report a problem.
3.) No errors on the host system.

[    1.794611] virtio: module verification failed: signature and/or required
key missing - tainting kernel

Tainted kernel.  How did this key verification failure of virtio module happen?

I don't know. But I don't see it any more with
virtio, but with scsi_mod on all my VMs like this:

[    1.747969] scsi_mod: module verification failed: signature and/or
required key missing - tainting kernel

[...]

[    3.162613]  [<ffffffff810f6ee2>] ?

This makes me wonder if your ram is bad, of if your VM image is corrupt.


A bunch of pci-related errors.  Are you using virtio and/or pcie
passthrough?  I wonder if there are complementary errors in the dmesg
of the host kernel?

No, there are no messages in dmesg. I'm using
virtio, but no passthrough.

Well that's odd...again, I suspect VM image corruption.
My gut feeling is your VM images are corrupted, and/or that you have
bad ram.  I also wonder if it could be a virtio and/or pcie
passthrough issue if either are enabled...

RAM seems to be fine. Using ECC-RAM and no errors
where reported. I also would't suspect the same
outcome every time, with a RAM error.
Could you please tell me a bit about your btrfs topology and which
features you've enabled?

Some information about the btrfs topology:

# mount |grep btrfs
/dev/sdc1 on /srv/storage type btrfs (rw,noatime,compress=lzo,space_cache,autodefrag,subvolid=5,subvol=/)

# btrfs filesystem show Label: 'BTRFS_RAID' uuid: 206a4530-6aff-4383-bb84-8cbf740eac1d
        Total devices 2 FS bytes used 1.03TiB
        devid    1 size 3.64TiB used 1.07TiB path /dev/sdc1
        devid    2 size 3.64TiB used 1.07TiB path /dev/sdd1

# btrfs fi df /srv/storage/
Data, RAID1: total=1.06TiB, used=1.03TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=176.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=3.00GiB, used=1.28GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=448.00MiB, used=0.00B

Also, with what version of btrfs-progs did
you create the volume?

ii  btrfs-tools    3.17-1.1     amd64 Checksumming Copy on Write Files

I just noticed, that they where replaced by btrfs-progs, which I installed now.

Have you ever run a btrfs-scrub, btrfs-check,
or btrfs-balance on the host? (running which kernel, with which
version of btrfs-progs...)  If you haven't yet, please don't run
either until we discuss this some more.  Was your btrfs volume created
with btrfs-convert?  Are you running the host's btrfs volume on top of
LVM, MD, and/or in combination with a caching layer?

I did run a scrub whith the above tools. Don't
know the exact kernel version, but it was a kernel
from jessie-backports, probably 4.3.x. No errors
reported. Never did a balance or convert.

Do you think I should do it again with the newer
btrfs-progs?

In addition to answering all of these questions, could you please
trigger this again and send the host dmesg?

I did some testing and after installing some
different kernel versions on the VM, I couldn't
trigger the problem anymore. The system is stable
for weeks now. I suspect the VM-Image got
corrupted. I still can't explain, why it always
booted without problems the first time and crashed
on reboot.

Finally, because you're testing btrfs, you have recent backups, and
backups from before this error manifested, right?

Yes, I have backups of my data. :)

thanks and best regards
Rolf

--
People should not be afraid of their governments, governments should be
afraid of their people. - V


Reply to: