[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: dpkg: The Installed-Size estimate can be wrong by a factor of 8 or a difference of 100MB



Hi,

I'm reviving this old bug as this came recently up again in the context of
ReproducibleBuilds.

On Sat, 26 Nov 2011 12:06:42 +0100 Helmut Grohne <helmut@subdivi.de> wrote:
> The actual problem
> ~~~~~~~~~~~~~~~~~~
> Problems with Installed-Size are not exactly new as discussion in
> http://bugs.debian.org/534408 (unit for Installed-Size) and
> http://bugs.debian.org/630533 (usage of du --apparent-size) have shown.
> So what is different this time? Installing the very same package on a
> btrfs yields a size that is much closer to the listed Installed-Size. (I
> don't have any numbers on this.) So whatever dpkg puts into this field,
> it *will* be wrong somewhere. The policy already mentions that this
> estimate cannot be accurate everywhere, but in fact it will be wrong by
> a factor of at least 2.5 (=sqrt(8)) or a difference of at least 50MB
> (=100MB/2) somewhere. Any attempt to change the computation of this
> value thus cannot fix this bug.
> 
> Discussion
> ~~~~~~~~~~
> In the example of libjs-mathjax the reason for the huge difference is
> the inclusion of a large number of very small files. Some filesystems
> allocate a block for each of these files and others are able to store
> multiple files in a block. A simple approach could be to include an
> additional field ("Installed-Files"?) that returns the number of files
> in the package. A second estimate for the Installed-Size would then be
> given by the number of files times the block size. The maximum of both
> estimates could be used. It would solve the immediate symptoms with
> libjs-mathjax. It is not without problems though. For instance I
> did not explain what block size to use. An administrator may have
> different file systems set up for / and /usr. Also the question remains
> whether this feature is worth the associated effort.
> 
> To get discussion going I pull in debian-policy@l.d.o.

we did some brainstorming in #debian-reproducible over the past days. I'll try
to summarize the discussion and Helmut can chip in if I missed something.

The fundamental problem is, that there are many ways that the target file
system on which the binary package gets installed can influence the size that
installing the package requires. This includes but is not limited to:

 - support for sparse files
 - inlining data inside the inode
 - compression
 - block-level or file-level duduplication

Additionally, disk usage can even grow when files are removed due to:

 - snapshots
 - overlay file system

Helmut argues, that an additional field like Installed-Files can improve the
approximation for file systems with different block sizes or whether or not
they can store multiple small files in a single block.

This solution could be extended to storing groups of files with similar size in
exponentially growing intervals of size (like: 4^(n-1) <= size < 4^n) and then
storing the number of files and cumulative number of bytes occupied by these
files in each of these sets.

But this can still not account for sparse files, compression or deduplication.

It is also worth asking what functionality the Installed-Size field is supposed
to have when looking for a solution. It's primary purpose is probably to give
apt a clue of whether or not there is enough free space to install a certain
package.

Helmut notes that other uses of the Installed-Size field are made by
debian-reference, popularity-contest, deborphan and cupt.

I would argue, that the only way to reliably solve this problem is either by:

 1) an over approximation of the actual value which will be larger than the
    actual file system usage on any common file system

 2) a way of apt or dpkg to ask the file system if there is enough space to
    store a certain file/directory structure. Most file systems (if any) do not
    offer this, though.

I think that an over approximation would be the right way to go because it is
better to wrongly warn the user that a binary package might not be installable
due to not sufficient remaining disk space, than to install a package without
sufficient remaining disk space and only fail once there actually is no more
space.

The addition of the `--apparent-size` argument to the du call in dpkg as a
response to bug #630533 made the value of the Installed-Size field too small in
some situations as can be seen in this bug report. The bug report in #630533
argues, thaht --apparent-size should be used precisely because there are file
systems that can store many small files more efficient. Because of my argument
in the last paragraph, I'd argue the opposite. The change was then applied with
guillem arguing that --apparent-size should be used because of consistency
between package rebuilds.

But the --apparent-size argument is not sufficient to provide this consistency.
Running `du -k -s --apparent-size` (the command currently used by dpkg) on an
unpacked mathjax source in an ext4 and btrfs file system, will report different
values for them. This is detrimental to the goals of the ReproducibleBuilds
efforts.

I thus propose that dpkg implements something of the following functionality
which (if I didn't miss to test something) will give an overapproximation for
Installed-Size and at the same time will be reproducible across different file
systems:

	( find mathjax-2.4 -type f -print0 \
		| du --files0-from=- -b; \
		find mathjax-2.4 \! -type f -printf "1\n" ) \
	| awk '{total = total + int($1/4096) + 4096}END{print total}'

I'm not proposing this code to be part of dpkg but I'm posting it because code
is precise but words are not. The same can probably easily be implemented in
perl.

The above snippet will get the number of bytes from all regular files and treat
all non-regular file entries as being 1 byte small. The following awk call will
then round all these values up to multiples of (an arbitrarily picked) block
size of 4096 bytes.

I think that finding the right solution to this problem requires to define what
the purpose of the Installed-Size field is. If it is to prevent package
installations on systems where there is not enough space, then I think an
overapproximation is the right way to go. More complicated measures will still
not be able to give a good approximation, given the feature-rich-ness of
today's file systems.

What do you think?

Thanks!

cheers, josch


Reply to: