[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: LVM write performance



On 8/14/2011 2:14 AM, Dion Kant wrote:
> On 08/13/2011 03:55 PM, Stan Hoeppner wrote:
>> My explanation to you wasn't fully correct.  I confused specifying no
>> block size with specifying an insanely large block size.  The other post
>> I was referring to dealt with people using a 1GB (or larger) block size
>> because it made the math easier for them when wanting to write a large
>> test file.
> Ok, that makes sense. It is the large specified bs which make the dd is
> going to buffer the data first.

Yep, it's really dramatic on machines with low memory due to swapping.
When I first tested this phenomenon with a 1GB dd block size on my
machine with only 384 MB RAM and a 1GB swap partition, it took many
minutes to complete, vs tens of seconds using a 4kB block size.  Almost
all of the 2GB of test was being pushed into swap, then being read from
swap and written to the file-- swap and file on the same physical disk.
 This is one of the reasons I keep this old machine around--problems of
this nature show up more quickly and are more easily identified.

>> Instead of dividing their total file size by 4096 and using the result
>> for "bs=4096 count=X" (which is the proper method I described to you)
>> they were simply specifying, for example, "bs=2G count=1" to write a 2
>> GB test file.  Doing this causes the massive buffering I described, and
>> consequently, horrible performance, typically by a factor of 10 or more,
>> depending on the specific system.
>>
>> The horrible performance with bs=512 is likely due to the LVM block size
>> being 4096, and forcing block writes that are 1/8th normal size, causing
>> lots of merging.  If you divide 120MB/s by 8 you get 15MB/s, which IIRC
>> from your original post, is approximately the write performance you were
>> seeing, which was 19MB/s.
> Recall that I took LVM out of the loop already. So now I am doing the
> experiment with writing data straight to the block device. In my case
> /dev/sdb4. (If writing on the block device level does not perform, how
> will LVM be able to perform?)
> 
> Inspired by your advice, I did some more investigation on this. I wrote
> a small test program, i.e. taking dd out of the loop as well. It writes
> 1 GB test data with increasing block sizes directly to /dev/sdb4. Here
> are some results:
> 
> root@dom0-2:~# ./bw
> Writing 1 GB
>      bs         time    rate
>    (bytes)       (s)   (MiB/s)
>          1    85.476   12.5619
>          2    33.016   32.5218
>          4   23.6675   45.3679
>          8    20.112   53.3881
>         16     18.76   57.2356
>         32    17.872   60.0795
>         64    17.636   60.8834
>        128    17.096   62.8064
>        256    17.188   62.4704
>        512   16.8482   63.7303
>       1024   57.6053   18.6396
>       2048     57.94    18.532
>       4096    17.016   63.1019
>       8192    16.604   64.6675
>      16384    16.452   65.2649
>      32768    17.132   62.6748
>      65536    16.256    66.052
>     131072     16.44   65.3127
>     262144    16.264   66.0194
>     524288    16.388   65.5199

The dips at 1024 & 2048 are strange, but not entirely unexpected.

> The good and problematic block sizes do not really coincide with the
> ones I observe with dd, but the odd behaviour is there. There are some
> magic block sizes {1,1024, 2048} which cause a drop in performance.
> Looking at vmstat output at the same time I see unexpected bi and the
> interrupt rate goes sky high.
> 
> In my case it is the ahci driver handling the writes. Here is the vmstat
> trace belonging to the bs=1 write and I add some more observations below:

Yeah, every platform will have quirks.

> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa
>  0  0      0 6379780  23820 112616    0    0     0     0   78   82  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   77   80  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   79   80  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   78   82  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   76   80  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   77   83  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   75   80  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   82   82  0  0
> 100  0
>  0  0      0 6379780  23820 112616    0    0     0     0   90   93  0  0
> 100  0
>  1  0      0 6376796  27132 112524    0    0   828     4  400  531  0  0
> 100  0
>  1  0      0 6346416  57496 112560    0    0  7590     0 2408 3877  5  0
> 92  3
>  1  0      0 6315788  88048 112580    0    0  7638     0 2435 3903  7  0
> 90  3
>  1  1      0 6284416 118548 112540    0    0  7624     0 2428 3903  6  0
> 91  3
>  1  0      0 6253168 148896 112564    0    0  7586     0 2403 3875  6  0
> 91  3
>  1  0      0 6221920 179284 112484    0    0  7596     0 2414 3884  5  0
> 93  2
>  1  0      0 6190672 209648 112540    0    0  7590     0 2417 3877  5  0
> 93  2
>  0  1      0 6160540 239796 112536    0    0  7540     0 6240 3851  6  0
> 76 18
>  0  1      0 6129540 269952 112584    0    0  7538     0 6255 3856  6  0
> 86  8
>  1  0      0 6098292 300116 112504    0    0  7540     0 6233 3853  5  0
> 89  6
>  1  0      0 6067540 330280 112552    0    0  7538     0 6196 3857  6  0
> 87  7
>  1  0      0 6036540 360452 112536    0    0  7542     0 6281 3868  5  0
> 89  6
>  1  0      0 6005540 390608 112464    0    0  7540     0 6268 3856  6  0
> 85  8
>  1  0      0 5974292 420788 112516    0    0  7542     0 6246 3865  6  0
> 86  7
>  1  0      0 5943416 450952 112444    0    0  7540     0 6253 3860  5  0
> 88  6
>  1  0      0 5912540 481128 112488    0    0  7546     0 6226 3861  6  0
> 86  7
>  1  0      0 5881292 511300 112472    0    0  7540     0 6225 3860  5  0
> 89  6
>  1  0      0 5850292 541456 112464    0    0  7538     0 6192 3858  6  0
> 86  7
>  0  2      0 5817664 570260 112516    0    0  7200 40706 5990 4820  6  0
> 81 13
>  0  2      0 5789268 597752 112472    0    0  6870     0 5775 5251  5  0
> 80 15
>  1  1      0 5760996 625164 112676    0    0  6854  8192 5795 5248  5  0
> 73 21
>  1  1      0 5732476 653232 112572    0    0  7014  8192 5285 5362  5  0
> 82 13
>  1  1      0 5704080 680924 112676    0    0  6922     0 2340 5290  3  0
> 92  5
>  1  1      0 5674504 709444 112540    0    0  7130  8192 2404 5469  5  0
> 71 24
>  1  1      0 5646184 737144 112484    0    0  6924     0 2320 5293  5  0
> 85 10
>  1  1      0 5617460 765004 112484    0    0  6966  8192 5844 5329  5  0
> 75 20
>  2  2      0 5588264 793288 112500    0    0  7068  8192 5313 5404  4  0
> 85 11
>  1  1      0 5559556 821084 112628    0    0  6948     0 2326 5309  8  0
> 78 14
>  1  1      0 5530468 849304 112476    0    0  7054  8192 2374 5395  5  0
> 75 20
>  1  1      0 5501892 876956 112464    0    0  6912  8192 2321 5285  5  0
> 85 10
>  0  2      0 5472936 905044 112584    0    0  7024     0 5889 5370  5  0
> 70 25
>  0  2      0 5444476 933096 112596    0    0  7010  8192 5874 5360  4  0
> 82 13
>  0  2      0 5415520 960924 112476    0    0  6960     0 5841 5323  6  0
> 70 24
>  1  1      0 5386580 989096 112696    0    0  7038  8192 5282 5384  6  0
> 69 25
>  2  2      0 5357624 1017164 112688    0    0  7016     0 2358 5362  4 
> 0 89  7
>  1  1      0 5328428 1045280 112580    0    0  7028  8192 2356 5379  5 
> 0 80 15
>  0  2      0 5296688 1072396 112540    0    0  6778 50068 2314 5194  0 
> 0 99  1
>  0  2      0 5297044 1072396 112616    0    0     0 64520  317  176  0 
> 0 75 24
>  0  2      0 5297044 1072396 112616    0    0     0 64520  310  175  0 
> 0 77 23
>  0  2      0 5297044 1072396 112616    0    0     0 64520  300  161  0 
> 0 85 15
>  0  2      0 5297052 1072396 112616    0    0     0 72204  317  180  0 
> 0 77 22
>  0  2      0 5297052 1072396 112616    0    0     0 64520  307  170  0 
> 0 84 16
>  0  1      0 5300540 1072396 112616    0    0     0 21310  309  203  0 
> 0 98  2
>  1  0      0 6351440  52252 112680    0    0    54    25  688  343  1  1
> 63 35
>  1  0      0 6269720 133036 112600    0    0     0     0  575   88  7  0
> 93  0
>  1  0      0 6186516 213812 112560    0    0     0     0  568   83  9  0
> 91  0
>  1  0      0 6103560 294588 112512    0    0     0     0  569   85  6  0
> 94  0
>  1  0      0 6020852 375428 112688    0    0     0     0  571   84  9  0
> 90  0
>  1  0      0 5937896 456244 112664    0    0     0     0  571   86  7  0
> 93  0
> 
> Writing to /dev/sdb4 starts when there is a step in the interrupt
> column. As long as the interrupts are high there is bi related to this
> writing. After initial buffering there is a first write to the disk at
> 40MB/s averaged over 2 seconds. Then only a couple of 8MB/s writes and
> in the mean time the (kernel) buffer is growing up to 1072396 kB. Then
> the driver starts writing at expected rates and the interrupt rate goes
> down to a reasonable level. It is only at the end of the write that the
> ahci driver gives back its buffer memory. After this, when the interrupt
> rate goes to a level of about 570, the ahci driver is swallowing the
> second write iteration with a block size of 2 bytes.

What do you see when you insert large delays between iterations, or run
each iteration after clearing the baffles, i.e.
$ echo 3 > /proc/sys/vm/drop_caches

> Here is the code fragment responsible for writing and measuring:
> 
>         sync();
>         gettimeofday(&tstart, &tz);
>         for (int i=0; i<Ntot/N; ++i)
>            sdb4->write(buf,N);
>         sdb4->flush();
>         sdb4->close();
>         sync();
>         gettimeofday(&tstop, &tz);
> 
> N is the block size and sdb4 is
> 
>         ofstream* sdb4 = new ofstream("/dev/sdb4", ofstream::binary);
> 
> 
> I think Stan is right that this may be something in the ahci kernel driver.

Could be.  Could just need tweaking, say queue_depth, elevator, etc.
Did you test with all 3 elevators or just one?

> I have some 3ware controller laying around. I might repeat the
> experiments with this and post them here if someone is interested.

If it doesn't have the hardware write cache enabled you will likely see
worse performance than with the current drive/controller.

-- 
Stan


Reply to: