[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: New CD image creation tool

On Sun, Dec 17, 2000 at 01:14:06AM +0100, J.A. Bezemer wrote:
> > this is my first post here - I've been lurking for some time, though.
> Don't feel ashamed -- according to http://lists.debian.org/stats/ we
> have nearly 600 lurkers here. (m-w.com: lurk: to be concealed but
> capable of being discovered; specifically: to constitute a latent
> threat ;-)

Um... OK! :-]

[rsync fetched around 200MB per CD image]
> Last week I've pseudo-image/rsyncd all i386 and source images, and
> always the rsync traffic was less than 10MB. IIRC one of the source
> images even had as little as 400kB.

Hm - I did make an error specifying the mirror, so the non-US packages
weren't found. That would explain the additional traffic for *one* CD
image - but it happened for *all* of them!?!

[Current pseudo image kit might cause people to use another distribution]
> Frankly, I don't have many problems with that. I have some fear that
> if we would provide some very-fancy graphical-and-all CD downloading
> tool, many people would start giving up right _after_ downloading
> CDs, and I'd rather have that _before_. Not only to save (much!) 
> bandwidth, but also to prevent unfounded criticism of the
> distribution itself.

I don't agree! I'd like an easy download *and* easy installation *and*
an easily maintainable Linux system!-) Sure, you have to solve one of
these at a time, but since there already seems to be an effort of
improving the installer, why not address the CD download problem as

[Proposed new scheme]

> > In detail, this is how it might work:
> > 
> > --- 1 ---
> > Someone uses debian-cd to create an ISO image. However, instead of the
> > actual image, mkhybrid creates a diff-like file, which consists of
>                 -- (a heavily patched version, you mean)

<fx: optimist mode>
I mean a new version of mkhybrid containing my patches. ;-)

> > entries like:
> > 
> >         Copy xxx bytes directly to output file [followed by data]
> > 	Insert contents of a file with checksum xxx here. 512k chunks
> > 	  of the file have the checksums xxx, yyy, ...
> > 
> > Subdividing the files into fixed-size chunks allows for detecting
> > early if you're downloading the wrong file, resuming from an
> > interrupted download, even for concurrent download of parts of the
> > same file from different servers.
> > 
> > This "diff" file is not intended to be human-readable.
> Actually, I would very much like it to be human-readable, except for
> the "literal data" parts of course. That would give me (and other
> people) a better idea of what's going on (and that there's nothing
> secret/mysterious about it), and anyway ...

Yes, I also considered making it human-readable somehow - but with all
the literal data, it would look extremely messy even if you uu-encoded
that. In fact, that's what in the end led me to split up the
information into two parts; one binary part and one readable part
containing the info on where to get the packages. (At first, I wanted
to put everything into one file.)

> > The file is gzipped and put on the Debian servers.
> ..it shouldn't make much difference when gzipped. Actually, you
> might even try to ??encode the literal data parts and see if that
> would make much difference when gzipped. (And/or bzip2'd)

I've forgotten another goal I have: Using the tool should not be
limited to just Debian CDs, it should also work well for other,
similar tasks (ones where the "diff" contains lots of data). The tool
should know nothing about a CD image's layout, and that means that all
the zero-padding of files will have to go into the "diff".

We could introduce a special command "insert xxx zero bytes here", but
why not use compression instead? I think the image's directory data
will compress very well, too.

> > --- 2 ---
> > A second, human-readable file is created. Apart from a reference to
> > the above file, it contains information about which checksums map to
> > which filenames, and a list of mirrors. This is what it might look
> > like:
> > 
> >     [info]
> >     name: Debian GNU/Linux 2.2 r35 _Potato_ - ...
> (I surely hope less than 35 are needed ;-)
> >     diff: ftp://ftp.debian.org/pub/debian/cd-images/diffs/cd-diff-2.2r35.gz
> - Maybe 'diff' is the wrong name here, but I can't thing of a better one at
>   the moment. 

I don't like it, either - hmm... maybe "image template" sounds better? 
While we're at it, I'll call the human-readable file "location list"
from now on, until someone comes up with something better. ;)

> - Don't mention a server here, since the cd-diff-or-whatever should be
>   mirrored as widely as possible. Or.. do you mean that this file is
>   mirror-specific? (Would get quite complex for mirror maintainers.)

No, you're right. The line should read something like:

    diff: debian:cd-images/diffs/cd-diff-2.2r35.gz

> >     outputname: binary-i386-1.iso
> >     outputhash: 12345678
> > 
> >     [serverdirs]
> >     debian: ftp://ftp.leo.org/pub/comp/os/unix/linux/Debian/debian/
> >     nonUS: http://some.mirror.net/debian-nonUS/
> >     nonUS: ftp://ftp.debian.org/pub/debian/non-US/
> You can also have README.mirrors and README.non-US downloaded automagically
> and parse those for the correct info.

I would like to avoid doing that, to make the tool suitable for a
range of problems rather than just that of downloading CD images from
Debian mirrors. Additionally, this would disallow people to release
personal editions, where just a few files have been replaced with
things from their own homepages, or whatever.

IMHO, it ought to be the other way round; the /location list/ should
be updated whenever a server is added. (Or, alternatively, there could
be an #include directive for this.)

> >     [parts]
> >     # Either indirection through mirrored "server dir":
> >     de0281a4: debian:potato/r35/Contents.i386.gz
> >     d02a49b7: nonUS:main/binary-i386/ssh_1.2.3_i386.deb
> >     # ...or directly insert mirrored URL:
> >     d02a49b7: http://f.net/debian-nonUS/main/binary-i386/ssh_1.2.3_i386.deb
> >     # 2 different leafnames for same file (checksum is the same):
> >     e1ee7000: directoryA:foo-0.1.tar.gz
> >     e1ee7000: directoryB:foo-0.1.tgz
> > 
> > An initial version of the [parts] section is also output by mkhybrid.


> > - By querying servers before it starts to download, the tool can
> >   determine whether all files are actually available.
> You mean, downloading an ls-lR? Or querying for each individual
> file? (The latter wouldn't be advantageous since _if_ they exist
> we'll be downloading them later anyway.)

I was thinking of individual queries, although directory scans are
probably better. Why would an initial check not be an advantage? It
allows you to abort straight away, instead of downloading, say, a
hundred packages before encountering one which doesn't exist.

> Other thing: this patched mkhybrid/mkisofs should know that things
> like Packages(.gz) are CD-specific, and include them as literal data
> in the patch. That's probably only detectable with checking the
> #hardlinks, or !symlink in the symlink-farm case.

I hadn't thought of this problem. :-/ Your solution would probably
work, but maybe my initial idea (which I dropped later on) of how to
generate the image template is the better one after all:

You first create the standard ISO image with the current debian-cd. 
Then you run a special "image template creator" tool and give to it as
parameters the ISO image and a list of files. It uses an rsync-like
algorithm to detect which files are contained in the image at which
offsets. The above problem could then be solved just by specifying a
directory with a standard Debian mirror.

[Implementation of new scheme]

> > These are my requirements for the language used to implement the
> > client-side download tool:
> > 
> > 1) Should run on as many platforms as possible
> > 2) Should support both GUI facilities and command-line-only operation
> > 3) Should provide libraries for socket programming
> > 4) Should not require the user to install additional software to run
> I'd say 4) isn't a requirement. We'll ship binary for Win platforms
> anyway and on other platforms people will know how to type `make'.

Yes. I had hoped to avoid it, but it'll be necessary even with Java. 

> > Unless someone comes up with a solution to the browser security
> > problem, I'll just write the program as a standalone Java application. 
> > This violates my 4th requirement (users will have to download and
> > install the Java runtime), but I can see no alternatives.
> ...which is non-free? (I know absolutely nothing about Java, please
> bear with me ;-)

In terms of Free Java compilers, there's guavac and jikes. They seem
to work well. Additionally, gcj will compile Java directly into native
code - however, it only supports Java 1.0 (?).

With kaffe, there's also a Free JVM available. I'm not too sure of how
well it works - I tried to run javac on it once and it dumped core...

Sun's Java runtime is equivalent to a JDK without the development
stuff, and is non-free.

> And how about a Windows runtime env.? Availability, size?

Not too sure, there's probably only Sun's. It's not quite as large as
the whole JDK - the Debian info for "jdk1.1" claims it's nearly 5MB

> Any important platforms that do _not_ have a free(!!) Java runtime env.?

Do you mean free or Free? There might be problems with the latter.

> > BTW, I'm not a Debian developer, but if I ever get as far as
> > implementing all of the above, I want to become one.
> Then you should start your New Maintainer application right now,
> since it might well take a long long time to get in ;-)

You're right. I'll probably already have signed up by the time you
read this! :)



  __   _
  |_) /|  Richard Atterer
  | \/¯|  http://atterer.net
  ¯ ´` ¯
If they give you ruled paper, write the other way    -- Juan Ramón Jiménez

Reply to: