Bug#724454: fio reports data errors with 512B blocks on wheezy+xen

To: submit@bugs.debian.org
Subject: Bug#724454: fio reports data errors with 512B blocks on wheezy+xen
From: Ken Raeburn <raeburn@permabit.com>
Date: Mon, 23 Sep 2013 19:45:41 -0400
Message-id: <[🔎] 6eeh8f14re.fsf@just-testing.permabit.com>
Reply-to: Ken Raeburn <raeburn@permabit.com>, 724454@bugs.debian.org
Package: linux-image-3.9-0.bpo.1-amd64
Version: 3.9.6-1~bpo70+1

Other possibly relevant packages: Xen, lvm (2.02.95-7), fio (2.0.8-2),
libaio1:amd64 (0.3.109-3).

I've been seeing "fio" (flexible I/O tester) fail to read back the
expected data from a storage device when using 512B blocks under a
Wheezy domU under Xen (with Ubuntu 12.04 dom0, or Amazon EC2). The
problem doesn't seem to show up if the guest is running Squeeze, or not
under Xen, or if the block size is at least 4KB.

fio --bs=512 --rw=randwrite --filename=/dev/scratch
  --name=foo --direct=1 --iodepth=1024 --iodepth_batch_submit=1
  --ioengine=libaio --size=2M --do_verify=1 --verify=meta
  --verify_dump=1 --verify_fatal=1 --verify_pattern=0 -dmem

Translation: write each 512B block in the first 2MB of /dev/scratch in
randomized order, writing a recognizable pattern that includes a magic
number and the block's offset. Use libaio with O_DIRECT, submit one
block at a time, and keep up to 1024 I/O operations in flight at any
given time. Then read them back and verify the stored fields. If any
blocks fail verification, write out into the current directory two files
with the expected and found values. Enable the debugging option that
prints out the location of each buffer in memory.

Most of the time, fio reports that verification of some of the written
data fails, because the offset is wrong; examining the saved
"received" block I find the correct magic number but an offset that
belongs elsewhere in the file. Sometimes it's just a couple of blocks;
sometimes it's dozens. In very rare cases, the magic number appears to
be incorrect.

This happens with the Debian distributed fio binary, or with
locally-built fio 2.0.7 binaries compiled on squeeze.

Actually, on Xen guests we're running the Debian kernel sources, just
rebuilt with the configuration modified only to select gzip
compression. The dom0 for most of my tests is Ubuntu 12.04 (Xen
4.1.2-2ubuntu2.8), but we're also seeing it in Amazon's EC2.

This does not happen with the squeeze-built fio 2.0.7 binaries on:

 Wheezy on real hardware (direct device access or via LVM)
 Wheezy on VMware (direct device access)
 squeeze (3.2.0-0.bpo.3-amd64)
 CentOS 6.3 (2.6.32-358.18.1.el6.x86_64)
 SLES 11 SP3 (3.0.76-0.11-default)

It does fail on a Wheezy domU with a 3.10 kernel with a couple local
patches. We don't have any other systems handy with post-3.2 kernels.

The block device /dev/scratch is a logical volume on /dev/xvda2. The
rest of xvda2 is an LVM mounted as a file system, so it's tough to test
directly against xvda2 on these domUs. In /sys/block/dm-1/queue,
hw_sector_size, logical_block_size, minimum_io_size, and
physical_block_size are all 512, and max_segment_size is 4096.

The Xen device xvda2, in turn, is a logical volume defined on the
host. (The host is running Ubuntu 12.04, but we've seen the same
problem on Amazon EC2 guests also running Wheezy.)


It also fails with bs=1024 and bs=2048, but generally with fewer
verification failures reported. At bs=4096, it passes.

Changing the ioengine setting to psync (uses pwrite and pread instead of
libaio) or posixaio (glibc's aio implementation) makes the problem go
away.

Sequential write tests ("--rw=write") show fewer errors but still fail.

If I use "dd iflag=direct bs=512 count=..." to read the first 2MB from
the device, and examine the offsets myself, it all looks fine.

If I use a recent patch off the fio mailing list, which implements a
"verify-only" option that doesn't write the data but just verifies the
previously written data, verification still fails well after the initial
writing, so it doesn't seem to be tied to being the process that wrote
the data, or doing the verify immediately after issuing the writes.

So it looks like it's the random-access small reads that are the
problem; they seem to sometimes receive the wrong blocks.

Assuming the underlying storage device is a modern advanced-format hard
drive with 4KB sectors and just emulating 512B sectors, this testing is
going to involve not just partial-sector I/O on the device, and often
concurrent I/Os to one memory page.

So I tried the upstream development version of fio from its git
repository. (The test at the top still fails.) There's a new option to
specify block sizes not just for read and write, but also for trim
(--bs=X,Y,Z; the version in Debian doesn't complain if you pass three
values either, but it seems to set read=X and write=Z). If I use
"--bs=512,512,4096", this keeps the 512B block size for both reads and
writes and specifies a trim size of 4KB. This alters the buffer
allocation to use an array of 4KB blocks, with only the first 512B of
each actually used for I/O, as can be verified from the output when the
"-dmem" option is given. In this case, the data written and the pattern
of offsets selected should be unchanged, but each data buffer will be at
the start of its own page. The verification passes in this case.

With --bs=512:

[...]
mem      23064 io_u alloc 0x1272520, index 1018
mem      23064 io_u 0x1272520, mem 0x7f5a7a41f400
mem      23064 io_u alloc 0x1272800, index 1019
mem      23064 io_u 0x1272800, mem 0x7f5a7a41f600
[...]
meta: verify failed at file /dev/scratch offset 1512960, length 512
       received data dumped as scratch.1512960.received
       expected data dumped as scratch.1512960.expected
[...]

With --bs=512,512,4096:

[...]
mem      23068 io_u alloc 0x1f6f100, index 1011
mem      23068 io_u 0x1f6f100, mem 0x7f67093e1000
mem      23068 io_u alloc 0x1f6f420, index 1012
mem      23068 io_u 0x1f6f420, mem 0x7f67093e2000
[...]

... and no failures. Running btrace confirms that it's still issuing I/O
operations of one sector; the trim block size doesn't affect the I/O
directly.

So, at this point I'm thinking that there's some issue in 3.9 (and maybe
earlier) kernels somewhere relating to multiple outstanding direct-I/O
reads from Xen to the same page of domU process memory. Though why dd
with direct I/O doesn't see the problem too, I don't know; perhaps the
sequential nature of its I/Os makes the problem go away, maybe by
merging I/O requests.

A bit more experimentation seems to suggest that the "verify" phase uses
the "write" block size to issue its reads; presumably the "read" block
size is only used for tests mixing reads and writes (--rw=rw or
--rw=randrw), not for the post-writing verification phase. But changing
the "read" block size does seem to affect the I/O pattern for
verification:

With "--bs=512,1024", it succeeds, though the memory buffers are
allocated at 1KB intervals, just as they are with "--bs=1024" (which
fails). So that may poke a hole in my "multiple I/Os per page"
hypothesis. However, btrace shows quite different behavior for these two
cases. I'm seeing reads of two sectors at a time in both cases, but
"--bs=512,1024" alternates short bursts of writes and short bursts of
reads, whereas "--bs=1024" issues lots of writes and then lots of reads;
the latter may give more chances for I/Os to different parts of a given
page to run concurrently.

If I copy the libaio.so.1 from squeeze (where the test passes) and load
it in the wheezy fio test via LD_PRELOAD, the test still fails on
wheezy.

At this point, I could use help from someone more familiar with the Xen
I/O code, but it seems pretty clear there's a bug here, and my best
guess is that it's not in fio or libaio, maybe in the kernel or Xen. I'm
happy to run more experiments if they'll help.

Ken
Reply to:
Prev by Date: Bug#713943: X,org is going to garbage and freezing
Next by Date: Bug#721316: more info
Previous by thread: Bug#713943: X,org is going to garbage and freezing
Next by thread: Bug#721316: more info
Index(es):
- Date
- Thread