[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: proposal for a more efficient download process



hi

by quite a coincidence, while you people were discussing this idea, I was 
already implementing it, in a package called 'debdelta' : see
 http://lists.debian.org/debian-devel/2006/05/msg03120.html

Moreover, by some telepathy :-) , I already included features you were
proposing, and addressed problems you where discussing
(and other problems you were not discussing since you did not
try implementing it  :-) 

Here are the replies:

To curt manucredo : while the implementation is not exactly what you
were suggesting in your original email, it still achieves all desired
goals; moreover, it is alive an kicking.

'debdelta' differs from your implementation in this respect:
- it does not use dpkg-repack (for many good reasons, see below)
- it recreates the new .deb , and guarantees that it is equal to the 
  one in archives, so archive signatures can be verified;
  currently it does not patch into the filesystem 
  (altough  this would be an easy adaptation, if anybody wishes for it)

'debdelta' conforms to your requests, in that 
- it can recreate the new .deb, either using the installed version of
 the old .deb, or old .deb file.

On the bright side, everything is already working, there is already
a repository of patches available, and a method of downloading them.

To Tyler MacDonald :
 - 'debdelta' uses 'bsdiff' , or 'xdelta' as a fallback, see below
 - regarding this:
> Some work will have to go into the math to determine when it's
> actually more efficient to download the latest archive, etc.... just a
> fleeting mental note, the threshold should not be 100% of the full archives
> size, it should be 90 or 80% due to the CPU/RAM overhead of patching and the
> bandwidth/latency overhead of requesting multiple patch files vs. one
> stream of data.
This math must go in the client side, and it is in my TODO list
(see at the end of the README); it is a nice exercise in Dynamical Programming.

Anyway , currently the archive discards deltas that exceed ~50% of the
new .deb , just as an heuristic, and to keep disk usage low.

To Goswin von Brederlow :
>| bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1)

Ah so this is the correct formula! The man page just says '17*n'.

But  in my stats, that that is not the case; my stats
are estimating that the memory is '12*n' so that is what I use

>| bytes of memory, where n is the size of the old file and m is the
>| size of the new file. bspatch requires n+m+O(1) bytes.
> That is quite unacceptable. We have debs in debian up to 160Mb

'debdelta' has an option '-M ' to choose between 'xdelta' and 'bsdiff' ;
by default, it uses 'xdelta' when memory usage would exceed 50Mb ;
but in the server, I set '-M 200' since I have 1GB RAM there.
 
> Seems to be quite useless for patching full debs. One would have to
> limit it to a file-by-file approach.

This is in my TODO list. Actually, I have in mind a scheme to
break TARs at suitable points, I have to check if it is 
worthwhile ; I can discuss details.

To: Tyler MacDonald again:
>	True.. It'd probably only be efficient if the deltas were based on
> the contents of the .deb's before they're packed.

.. and this is the reason why I do not use dpkg-repack... why unpacking
data when I need them unpacked ?   :-)

Absolutely true. Look at this

$ ls -s tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb
 42388 tetex-doc_3.0-18_all.deb 42340 tetex-doc_3.0-17_all.deb

$ bsdiff tetex-doc_3.0-17_all.deb tetex-doc_3.0-18_all.deb brutal.bsdiff
$ ls -s brutal.bsdiff
 10092 brutal.bsdiff            

Hat tip to 'bsdiff', but we can do better...

$ ar p tetex-doc_3.0-17_all.deb data.tar.gz | zcat >  /tmp/17.tar
$ ar p tetex-doc_3.0-18_all.deb data.tar.gz | zcat >  /tmp/18.tar
$ ls -s /tmp/17.tar /tmp/18.tar

53532 /tmp/17.tar  53580 /tmp/18.tar

$ time bsdiff /tmp/17.tar /tmp/18.tar /tmp/tar.bsdiff

times: 
 real    2m4.994s user    2m3.947s
memory:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9784 debdev    25   0  471m 470m 1384 T  0.0 46.5   1:18.82 bsdiff
size:
  92 /tmp/tar.bsdiff 

so as you see, the reduction in size is impressive, 
but it uses too much memory  and takes too much time.

$ time xdelta delta -m 50M -9  /tmp/17.tar /tmp/18.tar /tmp/tar.xdelta
times:
 real    0m1.728s user    0m1.660s
memory...  it is too fast
size:
  236 /tmp/tar.xdelta

still good enough for our goal

----

Comparing to the above

$ ls -s pool/main/t/tetex-base/tetex-doc_3.0-17_3.0-18_all.debdelta

288 pool/main/t/tetex-base/tetex-doc_3.0-17_3.0-18_all.debdelta

(the extra 35kB are the script that 'debpatch' uses  :-( 
 actually, I told 'debdelta' to use 'bzip' instead of gzip
 in this cases, but it did not... just found another bug :-)  )

To:  Marc 'HE' Brockschmidt <he@ftwca.de>:
> Now the interesting questions: How many diffs do you keep?

very few, currently, due to space constraints; moreover , suppose that

 you have a_1.deb installed, a_1_2.debdelta and  a_2_3.debdelta are in
 pool of deltas, want to upgrade to a_3.deb

This would work if done by hand, just doing
$ debpatch  a_1_2.debdelta / /tmp/a_2.deb
$ debpatch  a_2_4.debdelta  /tmp/a_2.deb  /tmp/a_3.deb

but 'debdelta-upgrade' now is uncapable to exploit this situation;
so I keep only one delta for each deb

>  How do you
> integrate this approach with the minimal security Release files give us
> today?

recreated debs are identical to original in archive.

Currently the best way to use my package is:

$ apt-get update
$ su nobody -c debdelta-upgrade 
$ mv /tmp/archives/*deb  /var/cache/apt/archives
$ apt-get upgrade

(By default , debdelta-upgrade puts the resulting .deb in /tmp/archives;
 use --dir to your taste, though )

As you see , I propose to run debdelta-upgrade not as root, since it is
still in development. 

> What about the kind of signatures dpkg-sig provides?  

Those are supported.
  'debdelta' reproduces everything it sees into the .deb file,
considering it as an 'ar' archive (altough it is not exactly a 'ar'
archive, since 'ar' adds a '/' in the header , 'dpkg' does not );
it just treats control.tar.gz and data.tar.gz in a smarter way.


----- other FAQ I made up for you

Q: What about .debs where the data part is compressed with bzip ?

A: currently, is unsupported (I never found one :-)
  but I did write some code to support it.


Q: can 'debpatch' recreate the new .deb using the installed old .deb, even when
  -  there are dpkg-diversions ?
  - conf files where modified ?

A: yes, yes.


Q: can 'debpatch' recreate the new .deb using the installed old .deb, 
  when 'prelink' is used in the host?

A: currently, no.

a.

-- 
Andrea Mennucc



Reply to: