[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal: A new approach to differential debs

On Sun, Aug 13, 2017 at 10:53:16AM -0400, Peter Silva wrote:
> You are assuming the savings are substantial.  That's not clear.  When
> files are compressed, if you then start doing binary diffs, well it
> isn't clear that they will consistently be much smaller than plain new
> files.  it also isn't clear what the impact on repo disk usage would
> be.

The suggestion is for it to be opt-in, so packages can optimize their file
layout when opting in - for example, the gzipped changelogs and stuff can be
the rsyncable ones (this might need some changes in debhelper to generate
diffable files if the opt-in option is set).

> The most straigforward option:
> The least intrusive way to do this is to add differential files in
> addition to the existing binaries, and any time the differential file,
> compared to a new version exceeds some threhold size (for example:
> 50%) of the original file, then you end up adding the sum total of the
> diff files in addition to the regular files to the repos.  I haven't
> done the math, but it's clear to me that it ends up being about double
> the disk space with this approach. it's also costly in that all those
> files have to be built and managed, which is likely a substantial
> ongoing load (cpu/io/people)  I think this is what people are
> objecting to.

I would throw away patches that are too large, obviously. I think
that twice the amount of space seems about reasonable, but then we'll
likely restrict that to

(a) packages where it's actually useful
(b) to updates and their base distributions

So we don't keep oldstable->stable diffs, or unstable->unstable
diffs around. It's questionable if we want them for unstable at
all, it might just be too much there, and we should just use them
for stable-(proposed-)updates (and point releases) and stable/updates;
and experimental so we don't accidentally break it during the
dev cycle.

> A more intrusive and less obvious way to do this is to use zsync (
> http://zsync.moria.org.uk/. ) With zsync, you build tables of content
> for each file, using the same 4k blocking that rsync does. To handle
> compression efficiently, there needs to be an understanding of
> blocking, so a customized gzip needs to be used.  With such a format,
> you produce the same .deb's as today, with the .zsyncs (already in
> use?) and the author already provides some debian Packages files as
> examples.  The space penalty here is probably only a few percent.
> the resource penalty is one read through each file to build the
> indices, and you can save that by combining the index building with
> the compression phase.  To get differential patches, one just fetches
> byte-ranges in the existing main files, so no separate diff files
> needed.  And since the same mechanisms can (should?) be used for repo
> replication, the added cost is likely a net savings in bandwidth
> usage, and relatively little complexity.
> The steps would be:
>    -- add gzip-ware flavour to package creation logic
>    -- add zsync index creation on repos. (potentially combined with first step.)
>    -- add zsync client to apt-* friends.
> To me, this makes a lot of sense to do just for repo replication,
> completely ignoring the benefits for people on low bandwidth lines,
> but it does work for both.

It does not work for client use. apt by default automatically deletes
packages files after a successful install, and upgrades are fairly infrequent
(so an expiration policy would not help much either); so we can't rely on old
packages to be available for upgradexs.

Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply to: