[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: LVM write performance



On 08/20/2011 12:53 AM, Stan Hoeppner wrote:
On 8/19/2011 4:38 PM, Dion Kant wrote:

I now think I understand the "strange" behaviour for block sizes not an
integral multiple of 4096 bytes. (Of course you guys already knew the
answer but just didn't want to make it easy for me to find the answer.)

The newer disks today have a sector size of 4096 bytes. They may still
be reporting 512 bytes, but this is to keep some ancient OS-es  working.

When a block write is not an integral of 4096 bytes, for example 512
bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
it and finally write it back to the disk. This explains the bi and the
increased number of interrupts.

I did some Google searches but did not find much. Can someone confirm
this hypothesis?

The read-modify-write performance penalty of unaligned partitions on the
"Advanced Format" drives (4KB native sectors) is a separate unrelated issue.

As I demonstrated earlier in this thread, the performance drop seen when
using dd with block sizes less than 4KB affects traditional 512B/sector
drives as well.  If one has a misaligned partition on an Advanced Format
drive, one takes a double performance hit when dd bs is less than 4KB.

Again, everything in (x86) Linux is optimized around the 'magic' 4KB
size, including page size, filesystem block size, and LVM block size.
Ok, I have done some browsing through the kernel sources. I understand the VFS a bit better now. When a read/write is issued on a block device file, the block size is 4096 bytes, i.e. reads/writes to the disk are done using blocks equal to the page cache size, i.e. the magic 4KB.

Submitting a request with a block size which is not an integral multiple of 4096 bytes results in a call to ll_rw_block(READ, 1, &bh), which reads in 4096 blocks, one by one into the page cache. This must be done before the user data can be used to partially update the concerning buffer page in the cache. After being updated, the buffer is flagged dirty and finally written to disk (8 sectors of 512 bytes).

I found a nice debugging switch which helps monitoring the process.

echo 1 > /proc/sys/vm/block_dump

makes all bio requests being logged as kernel output.

Example:

dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync

[  239.977384] dd(6110): READ block 0 on dm-3
[  240.026952] dd(6110): READ block 8 on dm-3
[  240.027735] dd(6110): WRITE block 0 on dm-3
[  240.027754] dd(6110): WRITE block 8 on dm-3

The ll_rw_block(READ, 1, &bh) is  causing the reads which can be seen when monitoring with vmstat. The tests given below (as you requested) were carried out before I gained a better understanding of the VFS. However, remaining questions I still have are:

1. Why are the partial block updates (through
ll_rw_block(READ, 1, &bh)) so dramatic slow as compared to other reads from the disk?

2. Furthermore remember the much better performance I reported when mounting a file system on the block device first, before accessing the disk through the block device file. If I find some more spare time I do some more digging in the kernel. Maybe I find out that then a different set of f_ops are used for accessing the raw block device by the Virtual Filesystem Switch.


BTW, did you run your test with each of the elevators, as I recommended?
 Do the following, testing dd after each change.

$ echo 128 > /sys/block/sdc/queue/read_ahead_kb

dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   54.0373   19.8704            1024
      1024   54.2937   19.7765            1024
      2048   52.1781   20.5784            1024
      4096    13.751   78.0846            1024
      8192   13.8519   77.5159            1024

dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   53.9634   19.8976            1024
      1024   52.0421   20.6322            1024
      2048   54.0437    19.868            1024
      4096   13.9612   76.9088            1024
      8192   13.8183   77.7043            1024

dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq] 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   56.0087    19.171            1024
      1024    56.345   19.0565            1024
      2048   56.0436    19.159            1024
      4096   15.1232   70.9999            1024
      8192   15.4236   69.6168            1024

Also, just for fun, and interesting results, increase your read_ahead_kb
from the default 128 to 512.

$ echo 512 > /sys/block/sdX/queue/read_ahead_kb
$ echo deadline > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   54.1023   19.8465            1024
      1024   52.1824   20.5767            1024
      2048   54.3797   19.7453            1024
      4096   13.7252   78.2315            1024
      8192    13.727   78.2211            1024

$ echo noop > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   54.0853   19.8527            1024
      1024    54.525   19.6927            1024
      2048   50.6829   21.1855            1024
      4096   14.1272   76.0051            1024
      8192    13.914   77.1701            1024

$ echo cfq > /sys/block/sdX/queue/scheduler
dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   56.0274   19.1646            1024
      1024   55.7614    19.256            1024
      2048   56.5394    18.991            1024
      4096   16.0562   66.8739            1024
      8192   17.3842   61.7654            1024

Differences between deadline and noop are in the order of 2 to 3 % in favour of deadline. Remarkable is the run with the cfq elevator. It clearly has less performance, about 20% less (compared to the highest result) for the 512 read_ahead_kb case. Another try with the same settings:

dom0-2:~ # ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   56.8122   18.8999            1024
      1024   56.5486   18.9879            1024
      2048   56.2555   19.0869            1024
      4096    14.886   72.1311            1024
      8192    15.461   69.4486            1024

so it looks like the previous result was at the low end of the statistical variation.



These changes are volatile so a reboot clears them in the event you're
unable to change them back to the defaults for any reason.  This is
easily avoidable if you simply cat the files and write down the values
before changing them.  After testing, echo the default values back in.

I did some testing on a newer system with an AOC-USAS-S4i Adaptec AACRAID Controller on a Supermicro. It uses the aacraid driver. This controller supports RAID0,1,10 but with configuring the controller in a way that it published the disks as 4 single disk RAID0 to Linux (the controller cannot do JBOD), we obtained much better performance with Linux software RAID0, or striping with LVM or LVM on top of RAID0 as compared to RAID0 being managed by the controller. Now we obtain 300 to 350 MByte/s sustained write performance as about 150 MB/s when using the controller.

We use 4 ST32000644NS drives.

Repeating the tests on this system gives similar results, let alone that the 2 TB drives have a better write performance of about 50%.


capture4:~ # cat  /sys/block/sdc/queue/read_ahead_kb
128
capture4:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq

capture4:~ # ./bw /dev/sdc1
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
      8192    8.5879    125.03            1024
      4096   8.54407   125.671            1024
      2048   65.0727   16.5007            1024

Note the performance drop by a factor 1/8 halving the bs from 4096 to 2048.

Reading a drive is 8.8% faster and works for all block sizes:

capture4:~ # ./br /dev/sdc1
Reading 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
       512   7.86782   136.473            1024
      1024   7.85202   136.747            1024
      2048   7.85979   136.612            1024
      4096   7.86932   136.447            1024
      8192    7.8509   136.767            1024
 
dd gives similar results:
capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/s

Dion

Reply to: