[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [idea]: Switch default compression from "xz" to "zstd" for .deb packages

On Sat, Sep 16, 2023 at 10:31:20AM +0530, Hideki Yamane wrote:
>  Today I want to propose you to change default compression format in .deb,
>  {data,control}.tar."xz" to ."zst".

>  According to https://www.speedtest.net/global-index, broadband bandwidth
>  in Nicaragua becomes almost 10x
>  - 2012: 1.7Mbps
>  - 2023: 17.4Mbps
>  10x faster than past: it means that file size is not so much problem for us

That's broadband, a lot of folks have nothing but crappy 5G.

I just happen to have a package converted to multiple formats on the disk
because I tested/benchmarked format 0.939 vs 2.0[1].  And:

	  -h	     bytes
tar	5.5G	5839735844
gz	897M	 939926960
xz	375M	 392874208
zst	774M	 811105258

For this particular package zst gives over twice as big file.  You can pick
a stronger compression level but at that point we're just going up the
tradeoff curve.

> ## More CPUs
>  2012: ThinkPad L530 has Core i5-3320M (2 cores, 4 threads)
>  2023: ThinkPad L15 has Core i5-1335U (10 cores, 12 threads)
>  https://www.cpubenchmark.net/compare/817vs5294/Intel-i5-3320M-vs-Intel-i5-1335U
>   - i5-3320M: single 1614, multicore 2654
>   - i5-1335U: single 3650, multicore 18076 points.
>  And, xz cannot use multi core CPUs for decompression but zstd can.
>  It means that if we use xz, we just get 2x CPU power, but change to zst,
>  we can get more than 10x CPU power than 2012.

As someone with a 64-way amd64 desktop, and a purchased-but-not-delivered
64-core riscv64 box on the way, I understand the sentiment -- but, what
about parallelizing by unpacking multiple packages at the same time instead?
That's safer and doesn't cost compressing ratio[2].  I've prototyped this,
and even with current dpkg internals it shouldn't be hard to do (even if
dpkg runs keep switching between unpacking and configuring too often).

>  It reduces a lot of time for package installation.

There's a lot lot lot of other places in dpkg that could use a speedup, and
they don't come with such a tradeoff.  Especially fsync abuse: dpkg writes
all of its status every. single. step., fully. flushing. it. to. persistent.
storage. even. if. it's. a. dingy. SD. card.  So it does for every file it
unpacks; to a semi-ignorant onlooker it seems as if it uses some sort of
range coder just so it can fsync between fractional bits.

Even though there's no good generic way to ensure consistency of extracted
payload (POSIX lacks such API, you can use btrfs snapshots), at least the
dpkg state could win a lot by stopping assuming the limitations of ext2
apply to other filesystems.  On ext2 a crash may do unbounded damage to
the filesystem, using flat text files and fsyncs between every operation
improves recoverability, but any filesystem newer than that adds better
guarantees.  There are so many techniques that would avoid full state

> ## More storage bandwidth
>  SSD + PCIe 3/4/5 is enough, not be a blocker for decompression, now.

So wishing Optane NVDIMMs didn't get cancelled... :/

On the other hand, we could switch the compression for _some_ packages.
There's stuff that gets unpacked by buildds over and over.  Compilers and
library headers are not used much by end-users on dingy connections (and
us hackers tend to prioritize computing device spending compared to regular
people), thus what about switching stuff that's 1. not in build-essential
but 2. in a set shared by many Build-Deps?


[1]. https://lists.debian.org/debian-dpkg/2023/09/msg00014.html
[2]. Parallel compression, and especially decompression, is done by
     flushing and dropping old state every block.
⣾⠁⢠⠒⠀⣿⡁ Bestest pickup line:
⢿⡄⠘⠷⠚⠋⠀ "Cutie, your name must be Suicide, cuz I think of you every day."

Reply to: