Re: New CD image creation tool
Preface: VERY interesting piece of work! I've added some comments below.
(I read the thing top-down and inserted comments in the mean time, so some of
them turned out to be a bit redundant.)
On Sat, 16 Dec 2000, Richard Atterer wrote:
> this is my first post here - I've been lurking for some time, though.
Don't feel ashamed -- according to http://lists.debian.org/stats/ we have
nearly 600 lurkers here. (m-w.com: lurk: to be concealed but capable of being
discovered; specifically: to constitute a latent threat ;-)
> My own experience with the pseudo image kit
> I used the pseudo image kit about a week after Potato rev0 had been
> released in October, to download the three i386 and the first arm CD
> image. For various reasons I couldn't use Linux, so I tried WinNT and
> Solaris. The main problems I encountered:
> - rsync fetched around 200MB per CD image, in comparison to the
> "advertised" 6MB. If you consider that an awful lot of checksum info
> is also transferred during the rsync, this is completely
> unacceptable and counter-productive to the goal of the whole scheme.
Last week I've pseudo-image/rsyncd all i386 and source images, and always the
rsync traffic was less than 10MB. IIRC one of the source images even had as
little as 400kB.
> - rsync servers are slow. The German mirror needed >2 hours per CD
> image, cdimage.debian.org in the UK was more than twice as fast.
A long time ago I've done a little bit of work on a caching rsync proxy that
would overcome these problems. It is possible, but a bit complicated. Since
long I haven't had the time or incentive to work on it further, but if someone
knows how to program sockets, selects with timeouts, hashing and locking or
multithreading, I can fill him/her in with the details I already know.
> I'm sorry to say this, but the current pseudo image kit, together with
> the relatively complicated instructions to make it work, are the best
> way of ensuring that everyone except for the most determined people
> will give up and just use another Linux distribution, with ISO images
> readily available!
Frankly, I don't have many problems with that. I have some fear that if we
would provide some very-fancy graphical-and-all CD downloading tool, many
people would start giving up right _after_ downloading CDs, and I'd rather
have that _before_. Not only to save (much!) bandwidth, but also to prevent
unfounded criticism of the distribution itself.
> Why not to use rsync
> - rsync is not designed for "one server, many clients", it's too CPU
> intensive. (A special "server version" with cached checksums etc.
> might improve that, but who is going to write it?)
See above ;-)
> New proposed scheme
> [Of course, I'm very open to discussion about this!]
> My basic idea is to create a kind of "binary diff" between the package
> data and the ISO image, and to store that on Debian mirrors, ready for
> download via HTTP or FTP.
> In detail, this is how it might work:
> --- 1 ---
> Someone uses debian-cd to create an ISO image. However, instead of the
> actual image, mkhybrid creates a diff-like file, which consists of
-- (a heavily patched version, you mean)
> entries like:
> Copy xxx bytes directly to output file [followed by data]
> Insert contents of a file with checksum xxx here. 512k chunks
> of the file have the checksums xxx, yyy, ...
> Subdividing the files into fixed-size chunks allows for detecting
> early if you're downloading the wrong file, resuming from an
> interrupted download, even for concurrent download of parts of the
> same file from different servers.
> This "diff" file is not intended to be human-readable.
Actually, I would very much like it to be human-readable, except for the
"literal data" parts of course. That would give me (and other people) a better
idea of what's going on (and that there's nothing secret/mysterious about it),
and anyway ...
> [I've already had a look at the mkhybrid sources; implementing this
> does not seem to be a lot of work.]
> The file is gzipped and put on the Debian servers.
..it shouldn't make much difference when gzipped. Actually, you might even
try to ??encode the literal data parts and see if that would make much
difference when gzipped. (And/or bzip2'd)
> --- 2 ---
> A second, human-readable file is created. Apart from a reference to
> the above file, it contains information about which checksums map to
> which filenames, and a list of mirrors. This is what it might look
> name: Debian GNU/Linux 2.2 r35 _Potato_ - ...
(I surely hope less than 35 are needed ;-)
> diff: ftp://ftp.debian.org/pub/debian/cd-images/diffs/cd-diff-2.2r35.gz
- Maybe 'diff' is the wrong name here, but I can't thing of a better one at
- Don't mention a server here, since the cd-diff-or-whatever should be
mirrored as widely as possible. Or.. do you mean that this file is
mirror-specific? (Would get quite complex for mirror maintainers.)
> outputname: binary-i386-1.iso
> outputhash: 12345678
> debian: ftp://ftp.leo.org/pub/comp/os/unix/linux/Debian/debian/
> nonUS: http://some.mirror.net/debian-nonUS/
> nonUS: ftp://ftp.debian.org/pub/debian/non-US/
You can also have README.mirrors and README.non-US downloaded automagically
and parse those for the correct info.
> # Either indirection through mirrored "server dir":
> de0281a4: debian:potato/r35/Contents.i386.gz
> d02a49b7: nonUS:main/binary-i386/ssh_1.2.3_i386.deb
> # ...or directly insert mirrored URL:
> d02a49b7: http://f.net/debian-nonUS/main/binary-i386/ssh_1.2.3_i386.deb
> # 2 different leafnames for same file (checksum is the same):
> e1ee7000: directoryA:foo-0.1.tar.gz
> e1ee7000: directoryB:foo-0.1.tgz
> An initial version of the [parts] section is also output by mkhybrid.
> This file is also gzipped and put on the Debian server.
> --- 3 ---
> Anyone wishing to download the CD image only needs to tell a
> yet-to-be-written tool the location of the file from part 2 above.
> Using some heuristics to choose one or more servers to download from,
> it can fetch the parts and assemble the ISO image without further
> Advantages of this scheme:
> - CD images need very little space on any server. This allows for
> weekly snapshots, people releasing personal versions, etc. etc.
> - For the person downloading the data, it doesn't temporarily need
> twice the amount of disc space of the final image.
> - By querying servers before it starts to download, the tool can
> determine whether all files are actually available.
You mean, downloading an ls-lR? Or querying for each individual file? (The
latter wouldn't be advantageous since _if_ they exist we'll be downloading
them later anyway.)
Maybe you're thinking about Packages, but that won't work nicely any more when
using package pools.
> - It avoids all the problems associated with rsync
> - If done properly, still has all the advantages of the old kit, e.g.
> you can interrupt the download, choose a different server at any
> time, etc.
> Big disadvantage: If for whatever reason one or more files can't be
> fetched, the user ends up with an incomplete ISO image. (In that case,
> we'd probably still have to revert to rsync.) But one should make sure
> that this situation will never arise under normal circumstances. In
> any case, this error can be detected before the download begins.
Actually, with the package pools we might be able to solve that problem; it's
quite easy to have a "old-stable" distribution that always containes all
packages needed for the latest CD images.
Other thing: this patched mkhybrid/mkisofs should know that things like
Packages(.gz) are CD-specific, and include them as literal data in the patch.
That's probably only detectable with checking the #hardlinks, or !symlink in
the symlink-farm case.
> Implementation of this scheme
> These are my requirements for the language used to implement the
> client-side download tool:
> 1) Should run on as many platforms as possible
> 2) Should support both GUI facilities and command-line-only operation
> 3) Should provide libraries for socket programming
> 4) Should not require the user to install additional software to run
I'd say 4) isn't a requirement. We'll ship binary for Win platforms anyway and
on other platforms people will know how to type `make'.
> I initially thought of Tcl/Tk, like Anne, but IMNSHO Tcl is a really
> horrible language (apologies to any Tcl fans out there!), and I doubt
> it provides networking support. Maybe Perl+Tk might work, but
> including a Perl interpreter would make the client quite large to
> The only candidate that can satisfy all 4 requirements is Java.
> Initially, I thought I had found the perfect solution: A Java 1.1
> applet running in the user's web browser is just what we need. Very
> unfortunately, this doesn't seem to work AFAICT, because it's
> difficult for an applet to write to any files:
> - In Netscape 4.x, the applet needs to be signed to write to files.
> Does anybody know how much Netscape/VeriSign/... want for a
> certification signature?
> - What about Mozilla/Netscape 6?
> - MSIE has a different security scheme. I have no idea whether it's
> possible to allow an applet to write to disc - anybody? I /believe/
Oh I think so -- It's not a bug, it's a _feature_!! ;-))
> that Java 1.1 stuff works in MSIE - is this correct or are there any
> important incompatibilities?
> Unless someone comes up with a solution to the browser security
> problem, I'll just write the program as a standalone Java application.
> This violates my 4th requirement (users will have to download and
> install the Java runtime), but I can see no alternatives.
...which is non-free? (I know absolutely nothing about Java, please bear
with me ;-)
And how about a Windows runtime env.? Availability, size?
Any important platforms that do _not_ have a free(!!) Java runtime env.?
> BTW, I'm not a Debian developer, but if I ever get as far as
> implementing all of the above, I want to become one.
Then you should start your New Maintainer application right now, since it
might well take a long long time to get in ;-)