precalculated checksums for rsync [increases mirrors a bit]
I'm working a bit more on rsync and precalculated checksums.
The idea is to calculate the checksums of each block during upload and
store that in an acompaning file (.rsyncsum.<file>?).
This will reduce the load generated by rsync greatly, since it doesn't
have to fo through each file on the fly to calculate checksums but
will increase the mirror size.
The crucial factor is of cause the blocksize used. I suggest a
blocksize that increases with file the size. So small diff.gz file and
small debs will be finely grained while big file don't waste that much
Out of thin air I used "blocksize^3/1024 == filesize", which would
give the following sizes: (a block gives a 20 Byte checksum)
File Block #Blocks cksum-size (rounded up)
1K 1K 1 1K
2K 1K 2 1K
4K 1K 4 1K
8K 1K 8 1K
16K 1K 16 1K
32K 1K 32 1K
64K 1K 64 2K
128K 1K 128 3K
256K 1K 256 6K
512K 1K 512 11K
1M 1K 1024 21K 2%
2M 2K 1024 21K 1%
4M 2K 2048 41K
8M 2K 4096 81K
16M 4K 4096 81K 0.5%
32M 4K 8192 161K
64M 4K 16384 321K
128M 8K 16384 321K 0.25%
256M 8K 32768 641K
512M 8K 65536 1281K
1G 16K 65536 1281K 0.12%
2G 16K 131072 2561K
4G 16K 262144 5121K
8G 32K 262144 5121K 0.06%
This would mean an increase in the mirror of 2% for files < 2MB, 1%
for 2-16M and less above that.
Does this sound reasonable small to get accepted into the debian
archive? And is the blocksize small enough for your taste?
There could also be multiple blocksizes precalculated, for example for
1K and 64K blocks. That way one could do a rough sync and then
download the smaller checksums for unmatched blocks. That would of
cause need further changes in the rsync client. Does anyone think
thats worthwile? (I actually don't think its worth it for files < 64M
and for bigger files, the checksum file is comparitavley neglible
What do you think?
Can anyone sum up the total increase this would give? I suspect its
somewhere between 1% and 2%.
May the Source be with you.