Re: New CD image creation tool

To: Richard Atterer <richard@atterer.net>
Cc: debian-cd@lists.debian.org
Subject: Re: New CD image creation tool
From: "J.A. Bezemer" <costar@panic.et.tudelft.nl>
Date: Tue, 19 Dec 2000 15:36:52 +0100 (CET)
Message-id: <Pine.LNX.3.96.1001219142341.13639I-100000@panic.et.tudelft.nl>
In-reply-to: <20001217165557.A20172@atterer.net>
On Sun, 17 Dec 2000, Richard Atterer wrote:
> On Sun, Dec 17, 2000 at 01:14:06AM +0100, J.A. Bezemer wrote:

> [Current pseudo image kit might cause people to use another distribution]
> > Frankly, I don't have many problems with that. I have some fear that
> > if we would provide some very-fancy graphical-and-all CD downloading
> > tool, many people would start giving up right _after_ downloading
> > CDs, and I'd rather have that _before_. Not only to save (much!) 
> > bandwidth, but also to prevent unfounded criticism of the
> > distribution itself.
> 
> I don't agree! I'd like an easy download *and* easy installation *and*
> an easily maintainable Linux system!-) Sure, you have to solve one of
> these at a time, but since there already seems to be an effort of
> improving the installer, why not address the CD download problem as
> well?

Okay, you have a point here ;-)

> [Proposed new scheme]
> 
> > > In detail, this is how it might work:
> > > 
> > > --- 1 ---
> > > Someone uses debian-cd to create an ISO image. However, instead of the
> > > actual image, mkhybrid creates a diff-like file, which consists of
> >                 -- (a heavily patched version, you mean)
> 
> <fx: optimist mode>
> I mean a new version of mkhybrid containing my patches. ;-)
> 
> > > entries like:
> > > 
> > >         Copy xxx bytes directly to output file [followed by data]
> > > 	Insert contents of a file with checksum xxx here. 512k chunks
> > > 	  of the file have the checksums xxx, yyy, ...
> > > 
> > > Subdividing the files into fixed-size chunks allows for detecting
> > > early if you're downloading the wrong file, resuming from an
> > > interrupted download, even for concurrent download of parts of the
> > > same file from different servers.
> > > 
> > > This "diff" file is not intended to be human-readable.
> > 
> > Actually, I would very much like it to be human-readable, except for
> > the "literal data" parts of course. That would give me (and other
> > people) a better idea of what's going on (and that there's nothing
> > secret/mysterious about it), and anyway ...
> 
> Yes, I also considered making it human-readable somehow - but with all
> the literal data, it would look extremely messy even if you uu-encoded
> that. In fact, that's what in the end led me to split up the
> information into two parts; one binary part and one readable part
> containing the info on where to get the packages. (At first, I wanted
> to put everything into one file.)

Okay. But then I'd suggest cutting it a little more to the left (or whatever
side ;-), namely that only the literal binary data is in one file, and the "CD
creating recipe" along with all other human-readable data in another file (or
maybe even two/three files). Besides readability it would also allow "cooking"
the CD image with a shell script (parsing with "read", creating with
"cat >> .iso" and "tail -c +N | head -c M >> .iso"; okay, crude and
inefficient, but should work).

[...]
> We could introduce a special command "insert xxx zero bytes here", but
> why not use compression instead? I think the image's directory data
> will compress very well, too.

I expect that a "insert xxx YY bytes here" command would compress a little
better; besides, the uncompressed variant would require less
memory/diskspace/time. Of course you don't have to actually _use_ it if that
would be too difficult, but I feel it should at least be in the file format
specification.

> > > --- 2 ---
> > > A second, human-readable file is created. Apart from a reference to
> > > the above file, it contains information about which checksums map to
> > > which filenames, and a list of mirrors. This is what it might look
> > > like:
> > > 
> > >     [info]
> > >     name: Debian GNU/Linux 2.2 r35 _Potato_ - ...
> > (I surely hope less than 35 are needed ;-)
> > >     diff: ftp://ftp.debian.org/pub/debian/cd-images/diffs/cd-diff-2.2r35.gz
> > 
> > - Maybe 'diff' is the wrong name here, but I can't thing of a better one at
> >   the moment. 
> 
> I don't like it, either - hmm... maybe "image template" sounds better? 
> While we're at it, I'll call the human-readable file "location list"
> from now on, until someone comes up with something better. ;)

How about the concept of "cooking" a CD image? We have "special ingredients"
(i.e. the literal binary data), a "recipe" (image template) and a "grocery
list" of places (i.e. FTP/HTTP sites) where to get the "standard ingredients". 
Should be understandable even to people who didn't ever touch a computer
before ;-)

[...]
> > >     [serverdirs]
> > >     debian: ftp://ftp.leo.org/pub/comp/os/unix/linux/Debian/debian/
> > >     nonUS: http://some.mirror.net/debian-nonUS/
> > >     nonUS: ftp://ftp.debian.org/pub/debian/non-US/
> > 
> > You can also have README.mirrors and README.non-US downloaded automagically
> > and parse those for the correct info.
> 
> I would like to avoid doing that, to make the tool suitable for a
> range of problems rather than just that of downloading CD images from
> Debian mirrors. Additionally, this would disallow people to release
> personal editions, where just a few files have been replaced with
> things from their own homepages, or whatever.

Okay, so there could be things like
  personal: http://my-server.com/~whoever/MyDebianPackages
(or MyWhateverFiles). Good.

> IMHO, it ought to be the other way round; the /location list/ should
> be updated whenever a server is added. (Or, alternatively, there could
> be an #include directive for this.)

Frankly, I doubt that the ftp-masters would enjoy yet another file to
maintain. An #include would be very helpful indeed.

> > >     [parts]
> > >     # Either indirection through mirrored "server dir":
> > >     de0281a4: debian:potato/r35/Contents.i386.gz
> > >     d02a49b7: nonUS:main/binary-i386/ssh_1.2.3_i386.deb
> > >     # ...or directly insert mirrored URL:
> > >     d02a49b7: http://f.net/debian-nonUS/main/binary-i386/ssh_1.2.3_i386.deb
You can probably omit this feature; using a personal "grocery chain" is easier
(and less error-prone).

> > >     # 2 different leafnames for same file (checksum is the same):
> > >     e1ee7000: directoryA:foo-0.1.tar.gz
> > >     e1ee7000: directoryB:foo-0.1.tgz
> > > 
> > > An initial version of the [parts] section is also output by mkhybrid.
> 
> [snip]
> 
> [Advantages]
> > > - By querying servers before it starts to download, the tool can
> > >   determine whether all files are actually available.
> > 
> > You mean, downloading an ls-lR? Or querying for each individual
> > file? (The latter wouldn't be advantageous since _if_ they exist
> > we'll be downloading them later anyway.)
> 
> I was thinking of individual queries, although directory scans are
> probably better. Why would an initial check not be an advantage? It
> allows you to abort straight away, instead of downloading, say, a
> hundred packages before encountering one which doesn't exist.

I agree that checking would be an advantage, but it should not cost too much.
ls-lR.gz is quite big (1.3M on http://ftp.us.debian.org/debian/) (and some
mirrors have US and non-US combined in one ls-lR, others don't); scanning
directories doesn't work with HTTP servers and with FTP (using package pools)
it will transfer about the same size as the UNzipped ls-lR (~10M). Checking
every single file is only easy with HTTP and will still cost ~500 bytes(?) *
2000 files/CD = ~1MB per CD.

Furthermore, when using pools checking is not a really big issue since
everything should (at least _can_) still be available. Don't concentrate on it
now, you can very well add this "new feature!" later (and I guess making it
optional and "off-by-default" would be best for most users).

> > Other thing: this patched mkhybrid/mkisofs should know that things
> > like Packages(.gz) are CD-specific, and include them as literal data
> > in the patch. That's probably only detectable with checking the
> > #hardlinks, or !symlink in the symlink-farm case.
> 
> I hadn't thought of this problem. :-/ Your solution would probably
> work, but maybe my initial idea (which I dropped later on) of how to
> generate the image template is the better one after all:
> 
> You first create the standard ISO image with the current debian-cd. 
> Then you run a special "image template creator" tool and give to it as
> parameters the ISO image and a list of files. It uses an rsync-like
                                 -- "one example location of each
known grocery chain to check which ingredients can be fetched where" ;-)

> algorithm to detect which files are contained in the image at which
> offsets. The above problem could then be solved just by specifying a
> directory with a standard Debian mirror.
> 
> 
> [Implementation of new scheme]
> 
> > > These are my requirements for the language used to implement the
> > > client-side download tool:
> > > 
> > > 1) Should run on as many platforms as possible
> > > 2) Should support both GUI facilities and command-line-only operation
> > > 3) Should provide libraries for socket programming
> > > 4) Should not require the user to install additional software to run
[Java is also not "The Perfect Solution For Everything"(tm) ;-]

Don't worry about the clients too much at this point. Just like the
Pseudo-Image Kit, the file formats are ("should be") easy to interpret/use in
any language. Start with a simple implementation in whatever language you like
best and write fastest ;-)

The main problem at this point is the "recipe generator"; you should probably
start working on that first. That would at least allow us to see how well
things work, what amounts of diskspace etc. we're talking about; things we'll
need when discussing details with the ftpmasters.


Regards,
  Anne Bezemer
Reply to:
Follow-Ups:
- Re: New CD image creation tool
  - From: Mattias Wadenstein <maswan@acc.umu.se>
- Re: New CD image creation tool
  - From: Richard Atterer <richard@atterer.net>
References:
- Re: New CD image creation tool
  - From: Richard Atterer <richard@atterer.net>
Prev by Date: Re: Package organization on CDs [was Re: Packages files references packages in pool instead of binary-... location]
Next by Date: Re: New CD image creation tool
Previous by thread: Re: New CD image creation tool
Next by thread: Re: New CD image creation tool
Index(es):
- Date
- Thread