[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: dpkg: The Installed-Size estimate can be wrong by a factor of 8 or a difference of 100MB



[ Reincluding dpkg-bugs. :) ]

Hi!

On Wed, 2015-01-07 at 12:22:47 +0100, Johannes Schauer wrote:
> On Sat, 26 Nov 2011 12:06:42 +0100 Helmut Grohne <helmut@subdivi.de> wrote:
> > Discussion
> > ~~~~~~~~~~
> > In the example of libjs-mathjax the reason for the huge difference is
> > the inclusion of a large number of very small files. Some filesystems
> > allocate a block for each of these files and others are able to store
> > multiple files in a block. A simple approach could be to include an
> > additional field ("Installed-Files"?) that returns the number of files
> > in the package. A second estimate for the Installed-Size would then be
> > given by the number of files times the block size. The maximum of both
> > estimates could be used. It would solve the immediate symptoms with
> > libjs-mathjax. It is not without problems though. For instance I
> > did not explain what block size to use. An administrator may have
> > different file systems set up for / and /usr. Also the question remains
> > whether this feature is worth the associated effort.
> > 
> > To get discussion going I pull in debian-policy@l.d.o.
> 
> we did some brainstorming in #debian-reproducible over the past days. I'll try
> to summarize the discussion and Helmut can chip in if I missed something.
> 
> The fundamental problem is, that there are many ways that the target file
> system on which the binary package gets installed can influence the size that
> installing the package requires. This includes but is not limited to:
> 
>  - support for sparse files
>  - inlining data inside the inode
>  - compression
>  - block-level or file-level duduplication
   - block and inode size

For example filesystems with 1KiB blocks or smaller, or 32KiB or
bigger.

> Additionally, disk usage can even grow when files are removed due to:
> 
>  - snapshots
>  - overlay file system

> Helmut argues, that an additional field like Installed-Files can improve the
> approximation for file systems with different block sizes or whether or not
> they can store multiple small files in a single block.
> 
> This solution could be extended to storing groups of files with similar size in
> exponentially growing intervals of size (like: 4^(n-1) <= size < 4^n) and then
> storing the number of files and cumulative number of bytes occupied by these
> files in each of these sets.

This seems to me would bloat both the control file and the Packages
files.

> But this can still not account for sparse files, compression or deduplication.

As filesystems improve there will be an increasing delta between the
actual size of files and the space it takes on disk, trying to find a
portable solution to this seems like a lost cause to me, or one that
would lead to much code for not much gain.

> It is also worth asking what functionality the Installed-Size field is supposed
> to have when looking for a solution. It's primary purpose is probably to give
> apt a clue of whether or not there is enough free space to install a certain
> package.

Personally I've always taken it as a small hint of the approx. size of
the package, but the most interesting case which is always accurate
is to spot size differences between the previous and next package
version, for example from aptitude TUI.

Take into account that usually other disk usage is not accounted, like
files generated at run-time, caches, logs and similar. There's the
Extra-Size substvar just for that but I don't think any package is
actually making use of it (at least according to codesearch.d.o), so…

Also apt (and cupt) only use that information to print the size deltas,
apt only actually checks for available disk size for downloaded data,
which is the only thing that actually makes sense doing.

> Helmut notes that other uses of the Installed-Size field are made by
> debian-reference, popularity-contest, deborphan and cupt.
> 
> I would argue, that the only way to reliably solve this problem is either by:

>  1) an over approximation of the actual value which will be larger than the
>     actual file system usage on any common file system

What's a common filesystem? To me this means making the provided
information even worse.

One of the reasons I was convinced of switching to use --apparent-size
was because the information is then more correct relative to the source.
But trying to approximate the destination system is never going to be
satisfactory.

>  2) a way of apt or dpkg to ask the file system if there is enough space to
>     store a certain file/directory structure. Most file systems (if any) do not
>     offer this, though.

As mentioned above, I think this is too much work for no much gain (if
at all).

> I think that an over approximation would be the right way to go because it is
> better to wrongly warn the user that a binary package might not be installable
> due to not sufficient remaining disk space, than to install a package without
> sufficient remaining disk space and only fail once there actually is no more
> space.

Given that (at least apt and cupt) are not actually comparing the
available disk space with the accumulated packages Installed-Size,
there will be no warning anyway, and to me just making dpkg fail
gracefully on ENOSPC is the best option, anything else will just be
wrong somewhere. It used to be the case that dpkg misbehaved on
ENOSPC, but that should have been fixed long time ago (AFAIR during
the dpkg 1.14.x cycle).

In any case, I'm open to be convinced with compelling arguments to
use some other approximation. But I've not yet seen any that would
do that. Hmm does anyone know what is done in the rpm-world, probably
by yum (otherwise I might take a look)?

> But the --apparent-size argument is not sufficient to provide this consistency.
> Running `du -k -s --apparent-size` (the command currently used by dpkg) on an
> unpacked mathjax source in an ext4 and btrfs file system, will report different
> values for them. This is detrimental to the goals of the ReproducibleBuilds
> efforts.

This is then a different issue to the one reported, and one that
should be fixed. ISTR that at some point I noticed that at least
directory entries where a problem due to their varying size over
different filesystems, but it also has the problem of needing to
use an arbitrary size.

> I thus propose that dpkg implements something of the following functionality
> which (if I didn't miss to test something) will give an overapproximation for
> Installed-Size and at the same time will be reproducible across different file
> systems:
> 
> 	( find mathjax-2.4 -type f -print0 \
> 		| du --files0-from=- -b; \
> 		find mathjax-2.4 \! -type f -printf "1\n" ) \
> 	| awk '{total = total + int($1/4096) + 4096}END{print total}'

Symlink sizes are also accurate, besides directories the rest just
store information in the inode. Directories are the trickiest as they
are also a shared resource between packages, and they change size
depending on their contents.

So I'd say, for nodes with content (regular and symlinks), use their
apparent size, for containers (dirs) and metadata only nodes just
either 0 or 1KiB as the minimal unit size. Which should give a
consistent size regardless of the build filesystem.

> I think that finding the right solution to this problem requires to define what
> the purpose of the Installed-Size field is. If it is to prevent package
> installations on systems where there is not enough space, then I think an
> overapproximation is the right way to go. More complicated measures will still
> not be able to give a good approximation, given the feature-rich-ness of
> today's file systems.

I think preventing package installation through Installed-Size would
be a bad idea and quite annoying because it could disallow possible
installations which would be even more confusing.

Thanks,
Guillem


Reply to: