[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: file order in archives, dar, special tar/debs/udebs



Drew Scott Daniels wrote:

> On Fri, 21 Feb 2003, Christian Fasshauer wrote:
> >   Drew Scott Daniels wrote:
> > >I like the extra information that tar includes, more so in source files,
> > >and especially in "original" source files.
> > >
> > Me too, but there is nothing great so see in generated deb files. After
> > compilation, there have most files the same time and those who not, are
> > listed in the source package again with the original time.
> >
> Still, the option to have this may be useful to other potential users of
> dar. Perhaps by default having it off, and creating an option to include
> these time stamps? Is there any tar archiver that can do this already?

see more below!

> > >When looking through source
> > >code, seeing the date that a file was change/modified/created and it's
> > >attributes can tell allot about the history of the file. For binary
> > >packages, the value of the information is debatable. I'm on the fence as
> > >far as whether removing things like the number of user/group entries
> > >available and other things suggested in
> > >http://lists.debian.org/debian-dpkg/2003/debian-dpkg-200301/msg00049.html
> > >but I'm leaning towards their inclusion in standard debs, and their
> > >removal in udebs.
> > >
> > There was never a plan to remove user informations and file attributes.
> > dar supports even more file types than tar. Actually there is no
> > information lost, except the file creation and last access time. But as
> > I told above, these informations are mostly not interesting in deb files
> > because source files shows them better.
> >
> Well, I should have said metadata. I'm glad to hear that you're planning
> to include almost all the same metadata that gtar does.

Actually not planning. The first version of dar has performed those
informations.

> It'd be nice to be
> able to use dar for more than debs.

Something like a dar package?
Well, I've not planned this. But dar is not finished at all. Some features
are due to implement. Perhaps a library package would be a solution to
allow dpkg using this system as well as other applications.

> > >To save even more space, a special compressor with a dictionary of Debian > > >blocks and/or statistics could be used. Thus if few users/groups etc were > > >used then such a dictionary would pick up on this and be used. As far as > > >compression goes, if there's a common dictionary to all the files, then > > >the change in space based on what kind of tar/tar features are used would
> > >be minimal. The trouble is that gzip, bzip2, PPMd and alike won't work
> > >from common dictionaries. Dictionary importing, choosing etc is something
> > >that I am going to be writing into my compressor.
> > >
> > sounds great, is this compressor available somewhere?
> >
> Mine certainly isn't. I've only generated an inefficient table generator
> with block sizes of one bit. It's fairly useless as is. I only seem to
> have small chunks of time to work on my compression algorithm, but I've
> got a huge document discussing it. In my document I discuss enough to
> hopefully make my implementation a superset of all PPM algorithms.
>
> I have to decide which pieces of my algorithm are useless or have little
> value and whether to remove such features.
> I have to decide on:
> - a format.
> - useful blocksizes and whether to disallow other sizes. (Incl instruction
> set issues)
> - several context issues.
> - possible hardware implementations.
> - what standard files for building dictionaries should be used(large
> corpuses of free data)
> - what the most useful dictionaries would be
> ...

If there is once a running compression algorithm, it could be enhanced
with a dar archiving system, if you want....

> >...
> > The attribute string of each entry contains a pointer to the group and
> > owner (2 bytes), tar writes the user/owner as text.
> >
> Why do you assume all those directories are going to be there? For debs I
> don't know if they have to be, but again, it'd be nice to be able to use
> dar for more than debs.

Do you need an entire archiving program like tar or is a library enough?

>
> > For each file, tar writes the entire path in an array of static size and
> > if that is exhaused, tar opens even a second one which is large enough.
> > The hit is that tar writes the file contents directly after the file
> > informations. So file contents and file informations are mixed together.
> > That may disturb your sorted contents.
> >
> Does star and/or all other implementations of tar do this?

A check with star has shown a similar output to that produced by tar.
I think there is no great format difference in the tar derivates.

> ie, is this
> part of the tar standard or is this one of the hacks that the star vs tar
> bug/flame war/discussion refers to.

No, I havn't read the entire discussion but it shouldn't.
dar is something entirely different than tar.

>
> > I think separating file informations and file contents should lead to
> > better compressing results.
> >
> Possibly and usually. Existing implementations of compression algorithms
> would most defiantly benefit as they would not be able to see the pattern
> of metadata and data. Although, ones that can switch contexts or start new
> windows might fair better by having metadata between files. bzip2 (and I
> think gzip) have fixed size windows at static intervals so they both would
> be hindered by having metadata stored in between file content.

That should be a matter of the size to compress. It shouldn't match on
compressions of large data masses like kernel sources.

> > I would guess that my prediction algorithm will perform better if it has
> a predictor like a file name close to the data.

I suggest an assemble of our ideas. A concentrated toc at the archive beginning
and the contents could be placed wherever you want.

> That way the metadata can
> be used to predict what kind of table or dictionary to use without
> requiring keeping track of the information way back at the beginning of the
> archive.

Why? The entire archive structure could be initialized at first.
This would speed up the archive listing and all informations required,
would be present too.

>
> > >I'm suggesting that
> > >the files be passed to tar in a different or several different orders so
> > >that they may be stored in different orders. Although technically the
> > >order in which files are stored is a piece of information (thus
> > >technically this idea is lossy) it's extremely unimportant and when played > > >with can dramatically change the resulting archive. The input and output > > >to the tar files would be the same (lossless), the only difference would
> > >be the order in which files were stored.
>
> Reordering of the metadata might be possible if the location of file
> content is stored in pointers.

It's excactly what I mean.

> I don't know if you're going to use
> pointers or sentinels,

dar creates the entire file structure and save it. The content usage is
recognised through the file structure which provides the usage order
and the file size, which is stored in the TOC too. Any additional pointer
needs to be added, to satisfy your desire of content order.

Did you get my first mail which has the sample attachment?

> although your non-interrupted stream of file content
> ideas seem to indicate that sentinels in the file would be undesirable.
>
>      Drew Daniels
>
christian fasshauer





Reply to: