Bug#705124: base: Filesystem corruption issue

To: Anthony Sheetz <sheetzam@inspire.com>
Cc: 705124@bugs.debian.org
Subject: Bug#705124: base: Filesystem corruption issue
From: Ian Campbell <ijc@hellion.org.uk>
Date: Mon, 15 Apr 2013 09:54:44 +0100
Message-id: <[🔎] 1366016084.15783.141.camel@zakaz.uk.xensource.com>
Reply-to: Ian Campbell <ijc@hellion.org.uk>, 705124@bugs.debian.org
In-reply-to: <20130410121755.30398.37911.reportbug@who.inspire.com>
References: <20130410121755.30398.37911.reportbug@who.inspire.com>

On Wed, 2013-04-10 at 08:17 -0400, Anthony Sheetz wrote:
> Steps to reproduce:
> Install Debian Testing from Netinstall CD, amd64.
> Choose LVM and Full Disk Encryption, with a separate /home
> Resize /home to be 80GB
> Install openswan, connect to remote network
> Install xen
> Set up a virtual machine with Debian Stable using logical volumes as the backing store.
> 	fs: ext3
> 	network: NAT
> transfer a large (multigigabyte) file from a remote server over the internet to the virtual machine
> 
> Expected behavior: File transfers fine, md5sum agrees with remote system
> Observed behavior: md5sum never matches, done enough times, the ext3 fs becomes corrupted

Can I just confirm a few things please:

The VM disk backend is an LVM volume which is included in the full disk
encryption? I suppose it is using dm-crypt?

The ext3 fs which becomes corrupted is the guest VM filesystem, not the
dom0 filesystem nor a filesystem which is is what the the large
multigigabyte file which is transferred over the network consists of? 

On the face of it it sounds to me like the network corruption (md5sum
issue) and the eventual ext3 corruption must be separate issues. Or I
suppose it is possible that the file is received correctly but is
corrupted when written to the disk, but it's probably better to consider
them separately until we know one way or the other.

WRT the file transfer corruption: Is the file being transferred over the
openswan link? Did you ever happen to try a transfer over a
non-tunnelled connection? Were you able to successfully transfer the
file to the dom0 filesystem or to any other system (e.g. one not running
Xen) on this end of the openswan link? I'm not sure what error
detection/correction scp/rsync or if they have any additional
verification options which could be tried or perhaps it is possible to
run md5sum on the stream before it hits the disk (can one rsync/scp to
stdout? I doubt it). If you can transfer to dom0 OK then it might be
interesting to try turning off the various offloads (GSO, SG etc) on the
vif link.

WRT the filesystem corruption: How did the ext3 corruption manifest
itself? I wonder if the layering of crypto+lvm+xen-blkback is causing
the barriers which ext3 requires to function correctly to not occur in
the right places. Does something need to be manually configured to
enable barriers at some layer? (or perhaps I am thinking of DISCARD
support). If you were able to attempt to reproduce without the crypto
bit in dom0 for the VM disk that would be really useful. It might also
be interesting to try using the ext3 barrier mount option in the guest
to switch barriers either off or on (I can't remember what the default
was for Squeeze).

I appreciate that you may have redeployed/downgraded the systems so some
of the above experiments might be quite hard to try out but if you could
setup a spare system or something it would be very much appreciated.

Ian.

Reply to:

Follow-Ups:
- Bug#705124: base: Filesystem corruption issue
  - From: Anthony Sheetz <sheetzam@inspire.com>

Prev by Date: Re: Bug#705118: debian-installer: FTBFS on armhf: Unable to locate package nic-modules-3.2.0-4-mx5-di
Next by Date: Bug#705124: [Pkg-xen-devel] Bug#705124: downgrading, we would like to upgrade our developers to Testing. However, this bug prevents us from doing so, and would prevent us from migrating to 7.0 when it becomes released. Pretty critical to the system's stability.
Previous by thread: Bug#704885: BIOS boot failure after a kernel shutdown on Acer Aspire 5930G
Next by thread: Bug#705124: base: Filesystem corruption issue
Index(es):
- Date
- Thread