[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Cross-directory hard links in Debian packages



On Fri, Nov 15, 2013 at 01:50:05PM +0000, Jonathan Dowland wrote:
> I'm not sure that making a general rule based on an edge-case is a
> good idea.  Publican is not very popular at all, it's quite likely
> that none of the 70 or so people who have installed it have done
> anything unusual with mounts around /usr.

publican is just an example. You can find more packages employing the
same technique at
http://lintian.debian.org/tags/package-contains-hardlink.html.

But we should not only look at packages doing this, but packages that
are wasting precious mirror and disk space[1]:

binary package                  #files  #bytes

wims-extra-all			11057	44415092
mixxx-data			7302	8055125
widelands-data			6692	12953306
code-aster-test			3225	59938595
sofia-sip-doc			3146	6848743
mailman				1745	2007439
texlive-lang-cjk		1619	4986872
spikeproxy			1602	5934959
acl2-doc			1598	7209512
freefoam-dev-doc		1495	3145120
wims				1458	2125970
triplea				1340	8641063
libqt4-dev			1337	5003042
libboost1.54-doc		1240	4131392
libgrib-api-1.10.4		1210	1678922
lazarus-doc-1.0.10		1174	10734571
python-matplotlib-doc		1172	24691971
fonts-mathjax-extras		1136	141683
libboost1.53-doc		1097	3717938
dotlrn				1096	5046637
libboost1.49-doc		1091	3578000
gnat-4.4			1083	10643007
openclipart2-libreoffice	1046	2142208
sql-ledger			1041	9248930
esys-particle			1025	8243181
typo3-src-4.5			1019	1528729
texlive-fonts-extra		998	4687576
moodle				959	6392249
openbox-themes			926	200312
xfwm4-themes			890	412192
grass-dev-doc			832	1124116
phpbb3-l10n			825	623634
fillets-ng-data			818	2712929
tuxpaint-stamps-default		813	2824876
optgeo				793	2681882
libbcel-java-doc		760	17640174
publican			750	5283082
msp430mcu			737	14475576
freegish-data			691	1252457
collabtive			687	1419645
fp-docs-2.6.2			683	2111629
libmapi-dev			681	31188
libnb-platform13-java-doc	678	1349378
murrine-themes			656	255650
ctpp2-doc			642	699880
fvwm-crystal			634	800295
pacemaker-dev			628	1399352
libknopflerfish-osgi-java-doc	598	4134711
libreoffice-dmaths		588	905010
freefoam-user-doc		587	883850

The numbers above are the achievable savings by using links. A few of
those files will not be hard linkable for crossing popular file system
boundaries. Still the projected savings are significant. Clearly, a
generic solution is desirable. If you are interested in details on the
savings of a particular package, visit
http://dedup.debian.net/compare/<package>/<package>. Roughly every 25th
file in the archive is duplicated within the same package. That's almost
1% of the uncompressed archive size.

> Looking at publican a number of questions occur to me
> 
>  * why hardlink all of the contents of
>    /usr/share/doc/publican/Users_Guide/desktop/$LOCALE/Common_Content
>    together rather than symlink them to some common directory like
>    /usr/share/publican/Common_Content? Is it because there might be
>    additions or omissions across locales?

Because it is more work to do so. One of the big advantages of using
hard links is that you don't have to choose a "primary location". These
hard links are generated at package build time.

>    * Can/should that not be handled within the tool itself (implement
>      a multi-directory lookup process)

Again this is more work. It might be possible in the case of publican,
but if you look at the list above, you'll quickly notice that this
approach doesn't scale.

Is there any technical reason for rejecting the usage of hard links in
binary packages besides common file system boundaries?

In any case clarifying and documenting whether cross-directory hard
links are a tool to be used seems worthwhile to me.
 * Either they are to be avoided at all costs, then we have a hand full
   of violations to be fixed,
 * or they are a tool that can be used to significantly shrink mirror
   and installation size at very little effort.

Helmut

[1] ssh delfin.debian.org sqlite3 /srv/dedup.debian.org/dedup.sqlite3
    '"SELECT package.name, sharing.files, sharing.size FROM package JOIN
    sharing JOIN function WHERE sharing.pid1 = package.id AND
    sharing.pid2 = package.id AND sharing.fid1 = function.id AND
    sharing.fid2 = function.id AND function.name = \"sha512\" ORDER BY
    sharing.files DESC LIMIT 50;"'


Reply to: