Re: file order in archives, dar, special tar/debs/udebs

To: Drew Scott Daniels <umdanie8@cc.UManitoba.CA>
Cc: Magnus Ekdahl <magnus@debian.org>, debian-devel@lists.debian.org
Subject: Re: file order in archives, dar, special tar/debs/udebs
From: Christian Fasshauer <mseacf@gmx.net>
Date: Sun, 02 Mar 2003 19:50:30 +0100
Message-id: <[🔎] 3E625276.7060700@gmx.net>
References: <Pine.GSO.4.40.0302241234590.17264-100000@antares.cc.umanitoba.ca>

Drew Scott Daniels wrote:

> On Fri, 21 Feb 2003, Christian Fasshauer wrote:
> >   Drew Scott Daniels wrote:

> > >I like the extra information that tar includes, more so in sourcefiles,

> > >and especially in "original" source files.
> > >
> > Me too, but there is nothing great so see in generated deb files. After
> > compilation, there have most files the same time and those who not, are
> > listed in the source package again with the original time.
> >
> Still, the option to have this may be useful to other potential users of
> dar. Perhaps by default having it off, and creating an option to include
> these time stamps? Is there any tar archiver that can do this already?

see more below!

> > >When looking through source
> > >code, seeing the date that a file was change/modified/created and it's
> > >attributes can tell allot about the history of the file. For binary

> > >packages, the value of the information is debatable. I'm on thefence as

> > >far as whether removing things like the number of user/group entries
> > >available and other things suggested in

> >>http://lists.debian.org/debian-dpkg/2003/debian-dpkg-200301/msg00049.html

> > >but I'm leaning towards their inclusion in standard debs, and their
> > >removal in udebs.
> > >
> > There was never a plan to remove user informations and file attributes.
> > dar supports even more file types than tar. Actually there is no
> > information lost, except the file creation and last access time. But as

> > I told above, these informations are mostly not interesting in debfiles

> > because source files shows them better.
> >
> Well, I should have said metadata. I'm glad to hear that you're planning
> to include almost all the same metadata that gtar does.

Actually not planning. The first version of dar has performed those
informations.

> It'd be nice to be
> able to use dar for more than debs.

Something like a dar package?
Well, I've not planned this. But dar is not finished at all. Some features
are due to implement. Perhaps a library package would be a solution to
allow dpkg using this system as well as other applications.

> > >To save even more space, a special compressor with a dictionary ofDebian> > >blocks and/or statistics could be used. Thus if few users/groupsetc were> > >used then such a dictionary would pick up on this and be used. Asfar as> > >compression goes, if there's a common dictionary to all the files,then> > >the change in space based on what kind of tar/tar features areused would

> > >be minimal. The trouble is that gzip, bzip2, PPMd and alike won't work

> > >from common dictionaries. Dictionary importing, choosing etc issomething

> > >that I am going to be writing into my compressor.
> > >
> > sounds great, is this compressor available somewhere?
> >
> Mine certainly isn't. I've only generated an inefficient table generator
> with block sizes of one bit. It's fairly useless as is. I only seem to
> have small chunks of time to work on my compression algorithm, but I've
> got a huge document discussing it. In my document I discuss enough to
> hopefully make my implementation a superset of all PPM algorithms.
>
> I have to decide which pieces of my algorithm are useless or have little
> value and whether to remove such features.
> I have to decide on:
> - a format.

> - useful blocksizes and whether to disallow other sizes. (Inclinstruction

> set issues)
> - several context issues.
> - possible hardware implementations.
> - what standard files for building dictionaries should be used(large
> corpuses of free data)
> - what the most useful dictionaries would be
> ...

If there is once a running compression algorithm, it could be enhanced
with a dar archiving system, if you want....

> >...
> > The attribute string of each entry contains a pointer to the group and
> > owner (2 bytes), tar writes the user/owner as text.
> >
> Why do you assume all those directories are going to be there? For debs I
> don't know if they have to be, but again, it'd be nice to be able to use
> dar for more than debs.

Do you need an entire archiving program like tar or is a library enough?

>

> > For each file, tar writes the entire path in an array of staticsize and

> > if that is exhaused, tar opens even a second one which is large enough.
> > The hit is that tar writes the file contents directly after the file

> > informations. So file contents and file informations are mixedtogether.

> > That may disturb your sorted contents.
> >
> Does star and/or all other implementations of tar do this?

A check with star has shown a similar output to that produced by tar.
I think there is no great format difference in the tar derivates.

> ie, is this
> part of the tar standard or is this one of the hacks that the star vs tar
> bug/flame war/discussion refers to.

No, I havn't read the entire discussion but it shouldn't.
dar is something entirely different than tar.

>
> > I think separating file informations and file contents should lead to
> > better compressing results.
> >
> Possibly and usually. Existing implementations of compression algorithms
> would most defiantly benefit as they would not be able to see the pattern

> of metadata and data. Although, ones that can switch contexts orstart new

> windows might fair better by having metadata between files. bzip2 (and I

> think gzip) have fixed size windows at static intervals so they bothwould

> be hindered by having metadata stored in between file content.

That should be a matter of the size to compress. It shouldn't match on
compressions of large data masses like kernel sources.

>> I would guess that my prediction algorithm will perform better if it has

> a predictor like a file name close to the data.

I suggest an assemble of our ideas. A concentrated toc at the archivebeginning

and the contents could be placed wherever you want.

> That way the metadata can
> be used to predict what kind of table or dictionary to use without

> requiring keeping track of the information way back at the beginningof the

> archive.

Why? The entire archive structure could be initialized at first.
This would speed up the archive listing and all informations required,
would be present too.

>
> > >I'm suggesting that

> > >the files be passed to tar in a different or several differentorders so

> > >that they may be stored in different orders. Although technically the
> > >order in which files are stored is a piece of information (thus

> > >technically this idea is lossy) it's extremely unimportant andwhen played> > >with can dramatically change the resulting archive. The input andoutput> > >to the tar files would be the same (lossless), the only differencewould

> > >be the order in which files were stored.
>
> Reordering of the metadata might be possible if the location of file
> content is stored in pointers.

It's excactly what I mean.

> I don't know if you're going to use
> pointers or sentinels,

dar creates the entire file structure and save it. The content usage is
recognised through the file structure which provides the usage order
and the file size, which is stored in the TOC too. Any additional pointer
needs to be added, to satisfy your desire of content order.

Did you get my first mail which has the sample attachment?

> although your non-interrupted stream of file content
> ideas seem to indicate that sentinels in the file would be undesirable.
>
>      Drew Daniels
>
christian fasshauer

Reply to:

Prev by Date: Re: Freeze Please?
Next by Date: Re: debconf template translation
Previous by thread: Re: GNAT 3.15p transition plan
Next by thread: Bug#183150: ITP: python-fam -- Python bindings of FAM routines
Index(es):
- Date
- Thread