Bug#705124: base: Filesystem corruption issue

To: Anthony Sheetz <sheetzam@inspire.com>
Cc: 705124@bugs.debian.org
Subject: Bug#705124: base: Filesystem corruption issue
From: Ian Campbell <ijc@hellion.org.uk>
Date: Mon, 15 Apr 2013 14:09:37 +0100
Message-id: <[🔎] 1366031377.4963.148.camel@zakaz.uk.xensource.com>
Reply-to: Ian Campbell <ijc@hellion.org.uk>, 705124@bugs.debian.org
In-reply-to: <[🔎] CAP8K4vLSW0bUhJCLaK56PL4i6ZV+wm8EaUciMmkTaW4F8mw__g@mail.gmail.com>
References: <20130410121755.30398.37911.reportbug@who.inspire.com> <[🔎] 1366016084.15783.141.camel@zakaz.uk.xensource.com> <[🔎] CAP8K4vLSW0bUhJCLaK56PL4i6ZV+wm8EaUciMmkTaW4F8mw__g@mail.gmail.com>

On Mon, 2013-04-15 at 08:19 -0400, Anthony Sheetz wrote:


>         Did you ever happen to try a transfer over a
>         non-tunnelled connection?
> 
> 
> Yes, tried file transfers from another machine on the local network -
> never had a problem with those. 

So this issue isn't the tunnel, good.

>         Were you able to successfully transfer the
>         file to the dom0 filesystem or to any other system (e.g. one
>         not running
>         Xen) on this end of the openswan link?
>  
> Yes - tried that several times, and was able to do the transfer with
> no corruption, and md5sum matched. 
> 
> 
>         I'm not sure what error
>         detection/correction scp/rsync or if they have any additional
>         verification options which could be tried or perhaps it is
>         possible to
>         run md5sum on the stream before it hits the disk (can one
>         rsync/scp to
>         stdout? I doubt it).
> 
> 
> Tried doing 'scp file.sql | md5sum' on DomU which resulted in a
> matching md5sum. We decided this eliminated the openswan link as the
> culprit.

This was in the domU? That would, I think, eliminate corruption in the
network at every stage including the dom0->domU link.

That would suggest that the md5sum failures you saw before were caused
by writing the file to disk and reading it back (which does at least
mean we only have one bug to deal with...)
 
>         If you can transfer to dom0 OK then it might be
>         interesting to try turning off the various offloads (GSO, SG
>         etc) on the
>         vif link.
> 
> 
> Any instructions on doing that?

The above makes me suspect this isn't a worthwhile experiment but in any
case:

"ethtool -k <device>" to examine and "ethtool -K <device> <offload> off"
to turn the various things off. I'd do it both on the device inside the
guest and the associated vifX.Y

>         I wonder if the layering of crypto+lvm+xen-blkback is causing
>         the barriers which ext3 requires to function correctly to not
>         occur in
>         the right places. Does something need to be manually
>         configured to
>         enable barriers at some layer? (or perhaps I am thinking of
>         DISCARD
>         support). If you were able to attempt to reproduce without the
>         crypto
>         bit in dom0 for the VM disk that would be really useful. It
>         might also
>         be interesting to try using the ext3 barrier mount option in
>         the guest
>         to switch barriers either off or on (I can't remember what the
>         default
>         was for Squeeze).
> 
> 
> Google led me to try mounting the file system with barriers=0, and no
> luck.

How did you do this? IIRC getting mount options to the root filesystem
to take effect involves more than just editing fstab (rootflags= on
command line I think? No idea how one inserts a space there)

For experimentation it might be useful to attach an xvdb to the domain
and use that as the write target, it'll allow easier experimentation
with mount options, and as a bonus you won't keep hosing your root
filesystem (which I imagine is getting pretty tedious...)

>         I appreciate that you may have redeployed/downgraded the
>         systems so some
>         of the above experiments might be quite hard to try out but if
>         you could
>         setup a spare system or something it would be very much
>         appreciated.
> 
> 
> We planned for this, and once we have some ideas to try (with some
> detailed instructions for trying them) we'll be purchasing a spare
> hard drive to try them out. We'd like this problem solved, and we're
> willing to spend a little to do it. 

Other than the barriers thing I think the most worthwhile thing to try
would be a Wheezy domU kernel.

Ian.

Reply to:

Follow-Ups:
- Bug#705124: base: Filesystem corruption issue
  - From: Anthony Sheetz <sheetzam@inspire.com>

References:
- Bug#705124: base: Filesystem corruption issue
  - From: Ian Campbell <ijc@hellion.org.uk>
- Bug#705124: base: Filesystem corruption issue
  - From: Anthony Sheetz <sheetzam@inspire.com>

Prev by Date: Bug#705124: base: Filesystem corruption issue
Next by Date: Bug#705124: base: Filesystem corruption issue
Previous by thread: Bug#705124: base: Filesystem corruption issue
Next by thread: Bug#705124: base: Filesystem corruption issue
Index(es):
- Date
- Thread