[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

precalculated checksums for rsync [increases mirrors a bit]



Hi,

I'm working a bit more on rsync and precalculated checksums.

The idea is to calculate the checksums of each block during upload and
store that in an acompaning file (.rsyncsum.<file>?).

This will reduce the load generated by rsync greatly, since it doesn't
have to fo through each file on the fly to calculate checksums but
will increase the mirror size.

The crucial factor is of cause the blocksize used. I suggest a
blocksize that increases with file the size. So small diff.gz file and
small debs will be finely grained while big file don't waste that much
space.

Out of thin air I used "blocksize^3/1024 == filesize", which would
give the following sizes: (a block gives a 20 Byte checksum)

File    Block   #Blocks cksum-size (rounded up)
  1K      1K          1      1K
  2K      1K          2      1K
  4K      1K          4      1K
  8K      1K          8      1K
 16K      1K         16      1K
 32K      1K         32      1K
 64K      1K         64      2K
128K      1K        128      3K
256K      1K        256      6K
512K      1K        512     11K
  1M      1K       1024     21K   2%
  2M      2K       1024     21K   1%
  4M      2K       2048     41K
  8M      2K       4096     81K
 16M      4K       4096     81K   0.5%
 32M      4K       8192    161K
 64M      4K      16384    321K
128M      8K      16384    321K   0.25%
256M      8K      32768    641K
512M      8K      65536   1281K
  1G     16K      65536   1281K   0.12%
  2G     16K     131072   2561K
  4G     16K     262144   5121K
  8G     32K     262144   5121K   0.06%

This would mean an increase in the mirror of 2% for files < 2MB, 1%
for 2-16M and less above that.

Does this sound reasonable small to get accepted into the debian
archive? And is the blocksize small enough for your taste?

There could also be multiple blocksizes precalculated, for example for
1K and 64K blocks. That way one could do a rough sync and then
download the smaller checksums for unmatched blocks. That would of
cause need further changes in the rsync client. Does anyone think
thats worthwile? (I actually don't think its worth it for files < 64M
and for bigger files, the checksum file is comparitavley neglible
anyway).

What do you think?

Can anyone sum up the total increase this would give? I suspect its
somewhere between 1% and 2%.

May the Source be with you.
                        Goswin



Reply to: