file order in archives, dar, special tar/debs/udebs
Debian-devel seems like the best place to archive this conversation and
get more input. This conversation is no longer relevant to the bug that
started this discussion.
On Fri, 21 Feb 2003, Christian Fasshauer wrote:
> Drew Scott Daniels wrote:
> >I like the extra information that tar includes, more so in source files,
> >and especially in "original" source files.
> >
> Me too, but there is nothing great so see in generated deb files. After
> compilation, there have most files the same time and those who not, are
> listed in the source package again with the original time.
>
Still, the option to have this may be useful to other potential users of
dar. Perhaps by default having it off, and creating an option to include
these time stamps? Is there any tar archiver that can do this already?
> >When looking through source
> >code, seeing the date that a file was change/modified/created and it's
> >attributes can tell allot about the history of the file. For binary
> >packages, the value of the information is debatable. I'm on the fence as
> >far as whether removing things like the number of user/group entries
> >available and other things suggested in
> >http://lists.debian.org/debian-dpkg/2003/debian-dpkg-200301/msg00049.html
> >but I'm leaning towards their inclusion in standard debs, and their
> >removal in udebs.
> >
> There was never a plan to remove user informations and file attributes.
> dar supports even more file types than tar. Actually there is no
> information lost, except the file creation and last access time. But as
> I told above, these informations are mostly not interesting in deb files
> because source files shows them better.
>
Well, I should have said metadata. I'm glad to hear that you're planning
to include almost all the same metadata that gtar does. It'd be nice to be
able to use dar for more than debs.
> >To save even more space, a special compressor with a dictionary of Debian
> >blocks and/or statistics could be used. Thus if few users/groups etc were
> >used then such a dictionary would pick up on this and be used. As far as
> >compression goes, if there's a common dictionary to all the files, then
> >the change in space based on what kind of tar/tar features are used would
> >be minimal. The trouble is that gzip, bzip2, PPMd and alike won't work
> >from common dictionaries. Dictionary importing, choosing etc is something
> >that I am going to be writing into my compressor.
> >
> sounds great, is this compressor available somewhere?
>
Mine certainly isn't. I've only generated an inefficient table generator
with block sizes of one bit. It's fairly useless as is. I only seem to
have small chunks of time to work on my compression algorithm, but I've
got a huge document discussing it. In my document I discuss enough to
hopefully make my implementation a superset of all PPM algorithms.
I have to decide which pieces of my algorithm are useless or have little
value and whether to remove such features.
I have to decide on:
- a format.
- useful blocksizes and whether to disallow other sizes. (Incl instruction
set issues)
- several context issues.
- possible hardware implementations.
- what standard files for building dictionaries should be used(large
corpuses of free data)
- what the most useful dictionaries would be
...
>...
> Actually dar removes no informations except the creation and
> modification time. But this feature could be added. I have one problem
> with tar:
>
> If you compress a tree like this:
>
> /usr/share/doc/debian/*
>
> tar puts the following file information in the output:
>
> /usr/share/doc/debian/bug-log-access.txt <file attributes and a lot of
> zeros> <file contents>
> /usr/share/doc/debian/bug-log-mailserver.txt <file attributes> <file
> contents> ... and so on
>
> dar works like this:
>
> <all groups and owners> <dir + attributes> usr <dir + attributes> share
> <dir + attributes> doc <dir + attributes> debian <file + attributes>
> bug-log-access.txt <file + attributes> bug-log-mailserver.txt <dir back>
> <dir back> <dir back> <dir back> <file contents of each file>
> The attribute string of each entry contains a pointer to the group and
> owner (2 bytes), tar writes the user/owner as text.
>
Why do you assume all those directories are going to be there? For debs I
don't know if they have to be, but again, it'd be nice to be able to use
dar for more than debs.
> For each file, tar writes the entire path in an array of static size and
> if that is exhaused, tar opens even a second one which is large enough.
> The hit is that tar writes the file contents directly after the file
> informations. So file contents and file informations are mixed together.
> That may disturb your sorted contents.
>
Does star and/or all other implementations of tar do this? ie, is this
part of the tar standard or is this one of the hacks that the star vs tar
bug/flame war/discussion refers to.
> I think separating file informations and file contents should lead to
> better compressing results.
>
Possibly and usually. Existing implementations of compression algorithms
would most defiantly benefit as they would not be able to see the pattern
of metadata and data. Although, ones that can switch contexts or start new
windows might fair better by having metadata between files. bzip2 (and I
think gzip) have fixed size windows at static intervals so they both would
be hindered by having metadata stored in between file content.
I would guess that my prediction algorithm will perform better if it has
a predictor like a file name close to the data. That way the metadata can
be used to predict what kind of table or dictionary to use without
requiring keeping track of the information way back at the beginning of the
archive.
> >I'm suggesting that
> >the files be passed to tar in a different or several different orders so
> >that they may be stored in different orders. Although technically the
> >order in which files are stored is a piece of information (thus
> >technically this idea is lossy) it's extremely unimportant and when played
> >with can dramatically change the resulting archive. The input and output
> >to the tar files would be the same (lossless), the only difference would
> >be the order in which files were stored.
In dar, reordering files may not be the only thing that can be done.
Reordering of the metadata might be possible if the location of file
content is stored in pointers. I don't know if you're going to use
pointers or sentinels, although your non-interrupted stream of file content
ideas seem to indicate that sentinels in the file would be undesirable.
Drew Daniels
Reply to: