[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: LVM write performance



On 08/13/2011 03:55 PM, Stan Hoeppner wrote:
> My explanation to you wasn't fully correct.  I confused specifying no
> block size with specifying an insanely large block size.  The other post
> I was referring to dealt with people using a 1GB (or larger) block size
> because it made the math easier for them when wanting to write a large
> test file.
Ok, that makes sense. It is the large specified bs which make the dd is
going to buffer the data first.

> Instead of dividing their total file size by 4096 and using the result
> for "bs=4096 count=X" (which is the proper method I described to you)
> they were simply specifying, for example, "bs=2G count=1" to write a 2
> GB test file.  Doing this causes the massive buffering I described, and
> consequently, horrible performance, typically by a factor of 10 or more,
> depending on the specific system.
>
> The horrible performance with bs=512 is likely due to the LVM block size
> being 4096, and forcing block writes that are 1/8th normal size, causing
> lots of merging.  If you divide 120MB/s by 8 you get 15MB/s, which IIRC
> from your original post, is approximately the write performance you were
> seeing, which was 19MB/s.
Recall that I took LVM out of the loop already. So now I am doing the
experiment with writing data straight to the block device. In my case
/dev/sdb4. (If writing on the block device level does not perform, how
will LVM be able to perform?)

Inspired by your advice, I did some more investigation on this. I wrote
a small test program, i.e. taking dd out of the loop as well. It writes
1 GB test data with increasing block sizes directly to /dev/sdb4. Here
are some results:

root@dom0-2:~# ./bw
Writing 1 GB
     bs         time    rate
   (bytes)       (s)   (MiB/s)
         1    85.476   12.5619
         2    33.016   32.5218
         4   23.6675   45.3679
         8    20.112   53.3881
        16     18.76   57.2356
        32    17.872   60.0795
        64    17.636   60.8834
       128    17.096   62.8064
       256    17.188   62.4704
       512   16.8482   63.7303
      1024   57.6053   18.6396
      2048     57.94    18.532
      4096    17.016   63.1019
      8192    16.604   64.6675
     16384    16.452   65.2649
     32768    17.132   62.6748
     65536    16.256    66.052
    131072     16.44   65.3127
    262144    16.264   66.0194
    524288    16.388   65.5199

The good and problematic block sizes do not really coincide with the
ones I observe with dd, but the odd behaviour is there. There are some
magic block sizes {1,1024, 2048} which cause a drop in performance.
Looking at vmstat output at the same time I see unexpected bi and the
interrupt rate goes sky high.

In my case it is the ahci driver handling the writes. Here is the vmstat
trace belonging to the bs=1 write and I add some more observations below:

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
 0  0      0 6379780  23820 112616    0    0     0     0   78   82  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   77   80  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   79   80  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   78   82  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   76   80  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   77   83  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   75   80  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   82   82  0  0
100  0
 0  0      0 6379780  23820 112616    0    0     0     0   90   93  0  0
100  0
 1  0      0 6376796  27132 112524    0    0   828     4  400  531  0  0
100  0
 1  0      0 6346416  57496 112560    0    0  7590     0 2408 3877  5  0
92  3
 1  0      0 6315788  88048 112580    0    0  7638     0 2435 3903  7  0
90  3
 1  1      0 6284416 118548 112540    0    0  7624     0 2428 3903  6  0
91  3
 1  0      0 6253168 148896 112564    0    0  7586     0 2403 3875  6  0
91  3
 1  0      0 6221920 179284 112484    0    0  7596     0 2414 3884  5  0
93  2
 1  0      0 6190672 209648 112540    0    0  7590     0 2417 3877  5  0
93  2
 0  1      0 6160540 239796 112536    0    0  7540     0 6240 3851  6  0
76 18
 0  1      0 6129540 269952 112584    0    0  7538     0 6255 3856  6  0
86  8
 1  0      0 6098292 300116 112504    0    0  7540     0 6233 3853  5  0
89  6
 1  0      0 6067540 330280 112552    0    0  7538     0 6196 3857  6  0
87  7
 1  0      0 6036540 360452 112536    0    0  7542     0 6281 3868  5  0
89  6
 1  0      0 6005540 390608 112464    0    0  7540     0 6268 3856  6  0
85  8
 1  0      0 5974292 420788 112516    0    0  7542     0 6246 3865  6  0
86  7
 1  0      0 5943416 450952 112444    0    0  7540     0 6253 3860  5  0
88  6
 1  0      0 5912540 481128 112488    0    0  7546     0 6226 3861  6  0
86  7
 1  0      0 5881292 511300 112472    0    0  7540     0 6225 3860  5  0
89  6
 1  0      0 5850292 541456 112464    0    0  7538     0 6192 3858  6  0
86  7
 0  2      0 5817664 570260 112516    0    0  7200 40706 5990 4820  6  0
81 13
 0  2      0 5789268 597752 112472    0    0  6870     0 5775 5251  5  0
80 15
 1  1      0 5760996 625164 112676    0    0  6854  8192 5795 5248  5  0
73 21
 1  1      0 5732476 653232 112572    0    0  7014  8192 5285 5362  5  0
82 13
 1  1      0 5704080 680924 112676    0    0  6922     0 2340 5290  3  0
92  5
 1  1      0 5674504 709444 112540    0    0  7130  8192 2404 5469  5  0
71 24
 1  1      0 5646184 737144 112484    0    0  6924     0 2320 5293  5  0
85 10
 1  1      0 5617460 765004 112484    0    0  6966  8192 5844 5329  5  0
75 20
 2  2      0 5588264 793288 112500    0    0  7068  8192 5313 5404  4  0
85 11
 1  1      0 5559556 821084 112628    0    0  6948     0 2326 5309  8  0
78 14
 1  1      0 5530468 849304 112476    0    0  7054  8192 2374 5395  5  0
75 20
 1  1      0 5501892 876956 112464    0    0  6912  8192 2321 5285  5  0
85 10
 0  2      0 5472936 905044 112584    0    0  7024     0 5889 5370  5  0
70 25
 0  2      0 5444476 933096 112596    0    0  7010  8192 5874 5360  4  0
82 13
 0  2      0 5415520 960924 112476    0    0  6960     0 5841 5323  6  0
70 24
 1  1      0 5386580 989096 112696    0    0  7038  8192 5282 5384  6  0
69 25
 2  2      0 5357624 1017164 112688    0    0  7016     0 2358 5362  4 
0 89  7
 1  1      0 5328428 1045280 112580    0    0  7028  8192 2356 5379  5 
0 80 15
 0  2      0 5296688 1072396 112540    0    0  6778 50068 2314 5194  0 
0 99  1
 0  2      0 5297044 1072396 112616    0    0     0 64520  317  176  0 
0 75 24
 0  2      0 5297044 1072396 112616    0    0     0 64520  310  175  0 
0 77 23
 0  2      0 5297044 1072396 112616    0    0     0 64520  300  161  0 
0 85 15
 0  2      0 5297052 1072396 112616    0    0     0 72204  317  180  0 
0 77 22
 0  2      0 5297052 1072396 112616    0    0     0 64520  307  170  0 
0 84 16
 0  1      0 5300540 1072396 112616    0    0     0 21310  309  203  0 
0 98  2
 1  0      0 6351440  52252 112680    0    0    54    25  688  343  1  1
63 35
 1  0      0 6269720 133036 112600    0    0     0     0  575   88  7  0
93  0
 1  0      0 6186516 213812 112560    0    0     0     0  568   83  9  0
91  0
 1  0      0 6103560 294588 112512    0    0     0     0  569   85  6  0
94  0
 1  0      0 6020852 375428 112688    0    0     0     0  571   84  9  0
90  0
 1  0      0 5937896 456244 112664    0    0     0     0  571   86  7  0
93  0

Writing to /dev/sdb4 starts when there is a step in the interrupt
column. As long as the interrupts are high there is bi related to this
writing. After initial buffering there is a first write to the disk at
40MB/s averaged over 2 seconds. Then only a couple of 8MB/s writes and
in the mean time the (kernel) buffer is growing up to 1072396 kB. Then
the driver starts writing at expected rates and the interrupt rate goes
down to a reasonable level. It is only at the end of the write that the
ahci driver gives back its buffer memory. After this, when the interrupt
rate goes to a level of about 570, the ahci driver is swallowing the
second write iteration with a block size of 2 bytes.

Here is the code fragment responsible for writing and measuring:

        sync();
        gettimeofday(&tstart, &tz);
        for (int i=0; i<Ntot/N; ++i)
           sdb4->write(buf,N);
        sdb4->flush();
        sdb4->close();
        sync();
        gettimeofday(&tstop, &tz);

N is the block size and sdb4 is

        ofstream* sdb4 = new ofstream("/dev/sdb4", ofstream::binary);


I think Stan is right that this may be something in the ahci kernel driver.

I have some 3ware controller laying around. I might repeat the
experiments with this and post them here if someone is interested.

Dion

> If my explanation doesn't seem thorough enough that's because I'm not a
> kernel expert.  I'm just have a little better than average knowledge/
> understanding of some of aspects of the kernel.
>
> If you want a really good explanation of the reasons behind this dd
> block size behavior while writing to a raw LVM device, try posting to
> lkml proper or one of the sub lists dealing with LVM and the block
> layer.  Also, I'm sure some of the expert developers on the XFS list
> could answer this as well, though it would be a little OT there, unless
> of course your filesystem test yielding the 120MB/s was using XFS. ;)
>
> -- Stan


Reply to: