[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Problem making large .deb files



Hi!

On Thu, 2016-05-19 at 16:55:29 +0100, Ian Jackson wrote:
> Guillem Jover writes ("Re: Problem making large .deb files"):
> > Yes, and a known one:
> >   <https://wiki.debian.org/Teams/Dpkg/TimeTravelFixes>
> > also mentioned in the format spec:
> >   <http://man7.org/linux/man-pages/man5/deb.5.html>a
> 
> A friend reports:
> 
> 16:32 <ewx> largest file in my debian mirror
> 16:33 <ewx> 1037152 ./pool/main/n/ns3/ns3-doc_3.17+dfsg-1_all.deb
> 16:34 <ewx> so debian is only a factor of 10 away from the limit
>             Diziet refers to

We've got way bigger binaries than that:

,---
$ egrep -h '^(Package|Version|Size):' /var/lib/apt/lists/*_Packages | \
  egrep -h -B2 '^Size: [0-9]{10,}$'
Package: mongodb-clients-dbgsym
Version: 1:2.6.11-1
Size: 1098615276
--
Package: mongodb-clients-dbgsym
Version: 1:2.6.11-1
Size: 1058537938
--
Package: insighttoolkit4-python-dbgsym
Version: 4.9.1-1
Size: 1023222166
--
Package: mame-dbgsym
Version: 0.173-6
Size: 1081054402
`---

> This suggests that we should start planning for a transition to
> increase this limit, right away.
> 
> The wiki suggests three solutions:
> 
>   * Use some other container format. But this would break detection as
>     a non-deb format. Painful.
> 
>   * Bump major version to 3, and split the large ar members into
>     different tar slices. For example: control.tar.xz, data-1.tar.xz,
>     data-2.tar.xz. Complex.
> 
>   * Bump major version to 3, and use an ar container, concatenated
>     with something else. Non-standard.
> 
> I would like to suggest consideration of a fourth:
> 
>   * Split long tarballs into separate byte streams which
>     are to be concatenated at read time.  For example:
>        data.tar.xz+aa
>        data.tar.xz+ab
>        data.tar.xz+ac
>        data.tar.xz+ad
>     Do this only for .debs where it's needed.  There is no need
>     to bump the major version, as this extension will already be
>     rejected by existing dpkg's.

To me this looks just like a varation on the second option listed
above. But it's true that the major does not need to be changed.

(I'll revise the wiki page.)

> This is probably not too hard to retrofit into the dpkg-deb ar parser.
> Manipulating (and creating) such archives by hand can be done with cat
> and split.

I guess. The biggest problem is that this requires doing considerable
changes to anything that currently handle .deb archives directly. Even
trivial changes like adding support for uncompressed data.tar or
control.tar.xz are not yet widely available. :/

  <https://wiki.debian.org/Teams/Dpkg/DebSupport>

So if we undertake this kind of format revision, it well better be
worth it, and make wise choices in the format chosen. I'd prefer if
we could get to use something which is also easy to implement and
easy to reason about.

> Doing this (with the longest possible suffix) for data.tar.xz will
> cope with up to data.tar.xz+zzzz, improving the limit by a factor of
> 26^4 for a maximum file length of 4569759999543024 bytes (4.6 E15
> bytes or 4156Tb).

We could also use 0-9a-z.

> Because the control.tar.gz filename is longer, it provides only an
> additional factor of 26, up to control.tar.gz+z (259 E9 bytes or
> 242 Gb).  But is anyone anywhere near the limit for control archives ?
> It seems likely that there are other problems with gigantic control
> archives.

I don't think we should bother with large control members.


Yesterday I had an inspiration for some crazy proposal related to the
PAX stuff. :) We could switch from an ar container to an uncompressed
PAX container, which has no limits. To preserve backward compatibility
at least when it comes to detecting that this is a .deb format, we could
use the first PAX header name field to store the ar magic and first ar
header and contents. Because the PAX header's filename is supposed to
be ignore anyway for archivers supporting the PAX format, but that might
be used as the extracted filename for ones that do not.

The nice thing is that the ustar header has a name field which is 100
chars long, and the ar magic + entire header is 68 chars long, which
both start at offset 0, and both are ASCII. Of course this might actually
confuse file detectors and archiving tools quite a bit, but seems like an
interesting hack. :)

(This reminded me of the multiple executable formats in the same file
hacks. :)

Thanks,
Guillem


Reply to: