Re: LVM write performance

To: debian-user@lists.debian.org
Subject: Re: LVM write performance
From: Dion Kant <msn@concero.nl>
Date: Tue, 30 Aug 2011 22:17:26 +0200
Message-id: <[🔎] 4E5D4556.1040506@concero.nl>
In-reply-to: <[🔎] 4E4EE96B.4060604@hardwarefreak.com>
References: <[🔎] 4E3F8173.5040401@concero.nl> <[🔎] 4E3FE596.8010904@hardwarefreak.com> <[🔎] 4E403233.5070807@concero.nl> <[🔎] 4E40B1AE.5080505@hardwarefreak.com> <[🔎] 4E40B7E1.1080707@hardwarefreak.com> <[🔎] 4E41404B.6020702@concero.nl> <[🔎] 4E416AA5.8060205@hardwarefreak.com> <[🔎] 4E4665A1.8090105@concero.nl> <[🔎] 4E468237.4020108@hardwarefreak.com> <[🔎] 4E4775BD.5030006@concero.nl> <[🔎] 4E47B035.5050703@concero.nl> <[🔎] 4E47BFF5.70903@concero.nl> <[🔎] 4E4ED7E2.6090102@concero.nl> <[🔎] 4E4EE96B.4060604@hardwarefreak.com>

On 08/20/2011 12:53 AM, Stan Hoeppner wrote:

On 8/19/2011 4:38 PM, Dion Kant wrote:

I now think I understand the "strange" behaviour for block sizes not an
integral multiple of 4096 bytes. (Of course you guys already knew the
answer but just didn't want to make it easy for me to find the answer.)

The newer disks today have a sector size of 4096 bytes. They may still
be reporting 512 bytes, but this is to keep some ancient OS-es  working.

When a block write is not an integral of 4096 bytes, for example 512
bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
it and finally write it back to the disk. This explains the bi and the
increased number of interrupts.

I did some Google searches but did not find much. Can someone confirm
this hypothesis?


The read-modify-write performance penalty of unaligned partitions on the
"Advanced Format" drives (4KB native sectors) is a separate unrelated issue.

As I demonstrated earlier in this thread, the performance drop seen when
using dd with block sizes less than 4KB affects traditional 512B/sector
drives as well.  If one has a misaligned partition on an Advanced Format
drive, one takes a double performance hit when dd bs is less than 4KB.

Again, everything in (x86) Linux is optimized around the 'magic' 4KB
size, including page size, filesystem block size, and LVM block size.

Ok, I have done some browsing through the kernel sources. I understand the VFS a bit better now. When a read/write is issued on a block device file, the block size is 4096 bytes, i.e. reads/writes to the disk are done using blocks equal to the page cache size, i.e. the magic 4KB. Submitting a request with a block size which is not an integral multiple of 4096 bytes results in a call to ll_rw_block(READ, 1, &bh), which reads in 4096 blocks, one by one into the page cache. This must be done before the user data can be used to partially update the concerning buffer page in the cache. After being updated, the buffer is flagged dirty and finally written to disk (8 sectors of 512 bytes). I found a nice debugging switch which helps monitoring the process. echo 1 > /proc/sys/vm/block_dump makes all bio requests being logged as kernel output.Example: dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync [ 239.977384] dd(6110): READ block 0 on dm-3 [ 240.026952] dd(6110): READ block 8 on dm-3 [ 240.027735] dd(6110): WRITE block 0 on dm-3 [ 240.027754] dd(6110): WRITE block 8 on dm-3
The ll_rw_block(READ, 1, &bh) is causing the reads which can be seen when monitoring with vmstat. The tests given below (as you requested) were carried out before I gained a better understanding of the VFS. However, remaining questions I still have are: 1. Why are the partial block updates (throughll_rw_block(READ, 1, &bh)) so dramatic slow as compared to other reads from the disk? 2. Furthermore remember the much better performance I reported when mounting a file system on the block device first, before accessing the disk through the block device file. If I find some more spare time I do some more digging in the kernel. Maybe I find out that then a different set of f_ops are used for accessing the raw block device by the Virtual Filesystem Switch.


BTW, did you run your test with each of the elevators, as I recommended?
 Do the following, testing dd after each change.

$ echo 128 > /sys/block/sdc/queue/read_ahead_kb


dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   54.0373   19.8704            1024
      1024   54.2937   19.7765            1024
      2048   52.1781   20.5784            1024
      4096    13.751   78.0846            1024
      8192   13.8519   77.5159            1024

dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   53.9634   19.8976            1024
      1024   52.0421   20.6322            1024
      2048   54.0437    19.868            1024
      4096   13.9612   76.9088            1024
      8192   13.8183   77.7043            1024

dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq] 
dom0-2:~ # ./bw 
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s) 
       512   56.0087    19.171            1024
      1024    56.345   19.0565            1024
      2048   56.0436    19.159            1024
      4096   15.1232   70.9999            1024
      8192   15.4236   69.6168            1024


Also, just for fun, and interesting results, increase your read_ahead_kb
from the default 128 to 512.

$ echo 512 > /sys/block/sdX/queue/read_ahead_kb

$ echo deadline > /sys/block/sdX/queue/scheduler

dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 54.1023 19.8465 1024 1024 52.1824 20.5767 1024 2048 54.3797 19.7453 1024 4096 13.7252 78.2315 1024 8192 13.727 78.2211 1024

$ echo noop > /sys/block/sdX/queue/scheduler

dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 54.0853 19.8527 1024 1024 54.525 19.6927 1024 2048 50.6829 21.1855 1024 4096 14.1272 76.0051 1024 8192 13.914 77.1701 1024

$ echo cfq > /sys/block/sdX/queue/scheduler

dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 56.0274 19.1646 1024 1024 55.7614 19.256 1024 2048 56.5394 18.991 1024 4096 16.0562 66.8739 1024 8192 17.3842 61.7654 1024 Differences between deadline and noop are in the order of 2 to 3 % in favour of deadline. Remarkable is the run with the cfq elevator. It clearly has less performance, about 20% less (compared to the highest result) for the 512 read_ahead_kb case. Another try with the same settings: dom0-2:~ # ./bw Writing 1 GB bs time rate (bytes) (s) (MiB/s) 512 56.8122 18.8999 1024 1024 56.5486 18.9879 1024 2048 56.2555 19.0869 1024 4096 14.886 72.1311 1024 8192 15.461 69.4486 1024 so it looks like the previous result was at the low end of the statistical variation.


These changes are volatile so a reboot clears them in the event you're
unable to change them back to the defaults for any reason.  This is
easily avoidable if you simply cat the files and write down the values
before changing them.  After testing, echo the default values back in.

I did some testing on a newer system with an AOC-USAS-S4i Adaptec AACRAID Controller on a Supermicro. It uses the aacraid driver. This controller supports RAID0,1,10 but with configuring the controller in a way that it published the disks as 4 single disk RAID0 to Linux (the controller cannot do JBOD), we obtained much better performance with Linux software RAID0, or striping with LVM or LVM on top of RAID0 as compared to RAID0 being managed by the controller. Now we obtain 300 to 350 MByte/s sustained write performance as about 150 MB/s when using the controller.We use 4 ST32000644NS drives. Repeating the tests on this system gives similar results, let alone that the 2 TB drives have a better write performance of about 50%. capture4:~ # cat /sys/block/sdc/queue/read_ahead_kb 128 capture4:~ # cat /sys/block/sdc/queue/scheduler noop [deadline] cfq capture4:~ # ./bw /dev/sdc1 Writing 1 GB bs time rate (bytes) (s) (MiB/s) 8192 8.5879 125.03 1024 4096 8.54407 125.671 1024 2048 65.0727 16.5007 1024 Note the performance drop by a factor 1/8 halving the bs from 4096 to 2048. Reading a drive is 8.8% faster and works for all block sizes: capture4:~ # ./br /dev/sdc1 Reading 1 GB bs time rate (bytes) (s) (MiB/s) 512 7.86782 136.473 1024 1024 7.85202 136.747 1024 2048 7.85979 136.612 1024 4096 7.86932 136.447 1024 8192 7.8509 136.767 1024 dd gives similar results: capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152 2097152+0 records in 2097152+0 records out 1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/sDion

Reply to:

References:
- LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LVM write performance
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Stan Hoeppner <stan@hardwarefreak.com>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Dion Kant <msn@concero.nl>
- Re: LVM write performance
  - From: Stan Hoeppner <stan@hardwarefreak.com>

Prev by Date: Re: Pixel garbage when opening context menus
Next by Date: Re: OT - SATA 3TB: unsupported sector size -1548812288.
Previous by thread: Re: LVM write performance
Next by thread: Re: LVM write performance
Index(es):
- Date
- Thread