[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

RFDisscusion: Big Packages.gz and Statistics and Comparing solution



Hi,

[Sorry for the thread broken, my POP3 provider stopped.]
[Please Cc: me! <zhaoway@public1.ptt.js.cn>. Sorry! ;-)]

1. RFDiscussion on big Packages.gz

1.1. Some statistics

% grep-dctrl -P -sPackage,Priority,Installed-Size,Version,Depends,Provides,Conflicts,Filename,Size,MD5sum -r '.*' ftp.jp.debian.org_debian_dists_unstable_main_binary-i386_Packages | gzip -9 > test.pkg.gz
% gzip -9 ftp.jp.debian.org_debian_dists_unstable_main_binary-i386_Packages 
% ls -alF *.gz
-rw-r--r--    1 zw       zw        1157494 Jan  7 21:20 ftp.jp.debian.org_debian_dists_unstable_main_binary-i386_Packages.gz
-rw-r--r--    1 zw       zw         341407 Jan  7 21:23 test.pkg.gz
% 

This approach is simple and straight and almost compatible. But could
accpect 10K more packages come into Debian with little loss. Worth
consideration. IMHO.

Better, if `Description:' etc. could come into seperate gzipped file along
with the Debian package.

1.2. Little math

Suppose: 1) Site A get K hits of `apt-get update' per day. With everyday
            passed, M extra hits added, as Debian goes more popular.
	 2) N new packages come into Debian every day. After `gzip -9',
	    each contribute 206 byte to old package index file, and 61 to
	    new format index file. Current package number is P.
	 3) Days passed as X axis.
	 4) B as the byte size of the data flow for `apt-get update' for
	    that day. On the server side. (Client side K =1, M = 0)

  B = (K + M*X) * (P + N*X) * 206       is for old format package index
  B = (K + M*X) * (P + N*X) * 61        is for new format package index
  
[It's still X^^2 function, anyway, so it's, in theory, not a big deal. ;-)]
[Only if we could eliminate the need for Package Index. That is possible. ]

  For K = 500, P = 6000, X = 0, Server side B is,
  zw@q ~/tmp % echo $((6000*500*206))
  618000000
  zw@q ~/tmp % echo $((6000*500*61))
  183000000
  zw@q ~/tmp % 
  
[Though the caches could help a great lot for servers in such cases.]
 
2. Compare with DIFF and RSYNC method of APT

2.1. They need server support. (More than a directory layout and client tool
     changing.)

2.2. If you don't update for a long time, DIFF won't help. RSYNC help less.

3. Additional benefits

Seperate changelog.Debian and `Description:' etc. out into meta-info file
could help users: 1) reduce the bandwidth eaten 2) help their upgrade
decisions easily.

-- 
echo <<EOF |cpp - -|egrep -v '(^#|^$)'
/*   =|=X ++
 *   /\+_ p7 <zw@debian.org> */
EOF



Reply to: