Re: Dropping awk?
Hi Simon,
On Thu, Apr 17, 2025 at 08:23:18PM +0200, Simon Josefsson wrote:
> I noticed that Fedora 42 was released and their docker images lack a
> 'awk' tool. Debian trixie images ship with 'mawk' pre-installed right
> now. While I'm not convinced the removal game is necessarily a good
> one, I can see that it does have some advantages. Is it possible to
> drop 'mawk' from the set of default tools in trixie? If not, what are
> the blockers? What is the method to find out what the blockers are?
shrinking essential/minbase/container images generally is a worthwhile
goal as you saw from existing replies. What is not as useful is asking
"can we drop XXX?" with little context, because (as others indicated)
this is a ton of work. The way to advance these matters is doing
research.
One of the first aspects is what "dropping" means. Typical answers:
* Removing "Essential: yes"
* e2fsprogs, mount and a few more used to be essential.
* Removing dependencies
* apt (not essential, but close) used to depend on adduser.
* Reducing the Priority value
* We've been debating this for ifupdown.
* Removing dependencies within the build-essential set
* I recently proposed removing libcrypt-dev from build-essential.
In this case, the immediate meaning must be getting it out of essential.
However, that does not move it out of container images, which incurs
further work and also raises the user impact (see Sean's mail).
Next, there is a question of what we gain. Essential weighs in at
roughly 100MB (depending on how you count it). So regarding awk, we're
talking about a size reduction of about 0.3%. For comparison, being able
to substitute toybox for coreutils has the potential to reduce more than
10% of size. Removing bash (keeping dash) would be around 7%. Whilst
those other gains are significantly higher, their impact and effort also
is. Picking a sensible candidate is the difficult part here.
It leads us to analyzing the effort and impact. Being in the essential
set means that dependencies are not spelled out. So the first step is
locating those dependencies. As we will likely not be able to audit
Debian's source code for awk uses in a reasonable amount of time,
empirical methods are likely needed.
* Rebuild the archive with awk dropped and see what fails
* Consider using reproducible builds to additionally see what packages
change as a result of dropping awk (for those that happen to be
reproducible)
* Search for awk usage in maintainer scripts
https://binarycontrol.debian.net/?q=awk&path=unstable%2F.*%2Fp
Note that postrm scripts cannot express dependencies and need to be
rewritten without awk. It also means that if you assume people to
always purge their packages, we may remove awk in forky+1 at best if
we manage to fix all postrm in forky.
* Download all Debian binary packages and search for awk uses in the
installed files using regular expressions.
* Run autopkgtests with awk removed
Doing this is a ton of work. Doing that work and presenting the results
is what makes "can we drop awk?" a useful question as it answers the
cost part.
This is not meant to discourage you. Quite to the contrary. Reducing
implicit software dependencies has lots of other benefits such as easing
architecture bootstrapping and a smaller trusted computing base. It is a
topic you cannot do in a spare evening though.
For instance, I'd like to propose making coreutils substitutable in
essential like awk is substitutable. However, that question is not
presently "useful" in the sense that it lacks a sound implementation.
I've been pondering this with Jochen and Johannes back in Würzburg and
now Julian has picked up the question and arrived at a promising
prototype based on feedback from Guillem. I hope that we are discussing
coreutils soon, but that discussion will be so much more useful when it
comes with a prototype and an impact analysis.
Helmut
Reply to: