[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

New CD image creation tool


this is my first post here - I've been lurking for some time, though.

On Fri, Dec 15, 2000 at 12:21:14AM +0100, J.A. Bezemer wrote:
> Tcl/Tk is available on cygwin, so it should be possible to have a
> portable graphical front-end for the Kit to be used by lazy people. 
> But since I don't have any experience in writing graphical things
> and/or using Tcl/Tk (and don't plan to get that experience anytime
> soon), there won't be such a thing unless someone else writes it. I
> think the main problem would be interfacing with the shell scripts
> and executables like rsync (capturing and processing their output),
> but I wouldn't mind anyone adapting things for that purpose, or even
> writing something better entirely from scratch. The concept of using
> .list files to create a pseudo-image is very simple and there are
> endless possibilities to implement it. I consider the Kit in its
> current form to be just an example implementation; you could write
> an ftp/http retreiver in C, add relevant bits of rsync and a simple
> textmode menu, and you'd probably have something more efficent and
> usable that what's available now. I'd be delighted to hear about
> such things ;-)

I have the intention of doing just that, i.e. to write a graphical
application from scratch which reduces the process of downloading
Debian CD images to a few mouse clicks.

Based on my own experience with the pseudo image kit, I've spent some
time thinking about how to best implement a download tool which
doesn't suffer from the problems of the current kit. IMHO, some
changes in the whole scheme would improve things a lot - see below.

This mail is a bit long, I'll subdivide it into sections:

My own experience with the pseudo image kit

I used the pseudo image kit about a week after Potato rev0 had been
released in October, to download the three i386 and the first arm CD
image. For various reasons I couldn't use Linux, so I tried WinNT and
Solaris. The main problems I encountered:

- Downloading packages is very slow under NT (3 hours for one CD's
  worth of data, from a nearby mirror <ftp.leo.org> which will deliver
  around 8.5Mbps). I did run the kit full-screen.
- Downloading packages is veeeeeeeeeeeery slooooooooow under Solaris. 
  :-) I eventually noticed something was wrong, and traced it to
  "wc -c". Yes, Solaris wc /will/ read each and every byte before
  reporting the size... After that was fixed, the download was as fast
  as I expected.
- rsync fetched around 200MB per CD image, in comparison to the
  "advertised" 6MB. If you consider that an awful lot of checksum info
  is also transferred during the rsync, this is completely
  unacceptable and counter-productive to the goal of the whole scheme.
- rsync servers are slow. The German mirror needed >2 hours per CD
  image, cdimage.debian.org in the UK was more than twice as fast.

I'm sorry to say this, but the current pseudo image kit, together with
the relatively complicated instructions to make it work, are the best
way of ensuring that everyone except for the most determined people
will give up and just use another Linux distribution, with ISO images
readily available!

Why not to use rsync

In my opinion, the main change to improve the speed, apart from a
better implementation of the HTTP/FTP fetcher, is to get rid of rsync. 
This is sad, because I think using rsync was a brilliant idea. 
However, in practice there are a few problems:

- The whole CD images need to be stored on the rsync server. Few
  people are ready to dedicate so much disc space, consequently there
  are too few rsync servers, and as a consequence of that, these are
  too slow.
  Additionally, from the point of view of a server admin it'll take
  much longer to set up an rsync server than to add another server to
  the list of servers to mirror.
- rsync is not designed for "one server, many clients", it's too CPU
  intensive. (A special "server version" with cached checksums etc.
  might improve that, but who is going to write it?)
- Conceptually, rsync is not designed for our application. It may
  sound strange, but rsync is too flexible! This flexibility is bought
  with additional overhead (both in terms of CPU and network).
- Being a "non-standard" protocol, rsync will not be let through most
  corporate firewalls. I guess quite a few people would prefer to
  download this amount of data at work...

I think the pseudo image kit should try to generate a 100% correct ISO
image itself first and (if at all) only fall back to rsync if that

New proposed scheme

[Of course, I'm very open to discussion about this!]

My basic idea is to create a kind of "binary diff" between the package
data and the ISO image, and to store that on Debian mirrors, ready for
download via HTTP or FTP.

In detail, this is how it might work:

--- 1 ---
Someone uses debian-cd to create an ISO image. However, instead of the
actual image, mkhybrid creates a diff-like file, which consists of
entries like:

        Copy xxx bytes directly to output file [followed by data]
	Insert contents of a file with checksum xxx here. 512k chunks
	  of the file have the checksums xxx, yyy, ...

Subdividing the files into fixed-size chunks allows for detecting
early if you're downloading the wrong file, resuming from an
interrupted download, even for concurrent download of parts of the
same file from different servers.

This "diff" file is not intended to be human-readable.
[I've already had a look at the mkhybrid sources; implementing this
does not seem to be a lot of work.]

The file is gzipped and put on the Debian servers.

--- 2 ---
A second, human-readable file is created. Apart from a reference to
the above file, it contains information about which checksums map to
which filenames, and a list of mirrors. This is what it might look

    name: Debian GNU/Linux 2.2 r35 _Potato_ - ...
    diff: ftp://ftp.debian.org/pub/debian/cd-images/diffs/cd-diff-2.2r35.gz
    outputname: binary-i386-1.iso
    outputhash: 12345678

    debian: ftp://ftp.leo.org/pub/comp/os/unix/linux/Debian/debian/
    nonUS: http://some.mirror.net/debian-nonUS/
    nonUS: ftp://ftp.debian.org/pub/debian/non-US/

    # Either indirection through mirrored "server dir":
    de0281a4: debian:potato/r35/Contents.i386.gz
    d02a49b7: nonUS:main/binary-i386/ssh_1.2.3_i386.deb
    # ...or directly insert mirrored URL:
    d02a49b7: http://f.net/debian-nonUS/main/binary-i386/ssh_1.2.3_i386.deb
    # 2 different leafnames for same file (checksum is the same):
    e1ee7000: directoryA:foo-0.1.tar.gz
    e1ee7000: directoryB:foo-0.1.tgz

An initial version of the [parts] section is also output by mkhybrid.

This file is also gzipped and put on the Debian server.

--- 3 ---
Anyone wishing to download the CD image only needs to tell a
yet-to-be-written tool the location of the file from part 2 above. 
Using some heuristics to choose one or more servers to download from,
it can fetch the parts and assemble the ISO image without further

Advantages of this scheme:

- CD images need very little space on any server. This allows for
  weekly snapshots, people releasing personal versions, etc. etc.
- For the person downloading the data, it doesn't temporarily need
  twice the amount of disc space of the final image.
- By querying servers before it starts to download, the tool can
  determine whether all files are actually available.
- It avoids all the problems associated with rsync
- If done properly, still has all the advantages of the old kit, e.g. 
  you can interrupt the download, choose a different server at any
  time, etc.

Big disadvantage: If for whatever reason one or more files can't be
fetched, the user ends up with an incomplete ISO image. (In that case,
we'd probably still have to revert to rsync.) But one should make sure
that this situation will never arise under normal circumstances. In
any case, this error can be detected before the download begins.

Implementation of this scheme

These are my requirements for the language used to implement the
client-side download tool:

1) Should run on as many platforms as possible
2) Should support both GUI facilities and command-line-only operation
3) Should provide libraries for socket programming
4) Should not require the user to install additional software to run

I initially thought of Tcl/Tk, like Anne, but IMNSHO Tcl is a really
horrible language (apologies to any Tcl fans out there!), and I doubt
it provides networking support. Maybe Perl+Tk might work, but
including a Perl interpreter would make the client quite large to

The only candidate that can satisfy all 4 requirements is Java.

Initially, I thought I had found the perfect solution: A Java 1.1
applet running in the user's web browser is just what we need. Very
unfortunately, this doesn't seem to work AFAICT, because it's
difficult for an applet to write to any files:
- In Netscape 4.x, the applet needs to be signed to write to files. 
  Does anybody know how much Netscape/VeriSign/... want for a
  certification signature?
- What about Mozilla/Netscape 6?
- MSIE has a different security scheme. I have no idea whether it's
  possible to allow an applet to write to disc - anybody? I /believe/
  that Java 1.1 stuff works in MSIE - is this correct or are there any
  important incompatibilities?

Unless someone comes up with a solution to the browser security
problem, I'll just write the program as a standalone Java application. 
This violates my 4th requirement (users will have to download and
install the Java runtime), but I can see no alternatives.

Any comments on all this?

BTW, I'm not a Debian developer, but if I ever get as far as
implementing all of the above, I want to become one.



  __   _
  |_) /|  Richard Atterer
  | \/¯|  http://atterer.net
  ¯ ´` ¯
If they give you ruled paper, write the other way    -- Juan Ramón Jiménez

Attachment: pgpJXghGNzgQg.pgp
Description: PGP signature

Reply to: