Re: request for review: lbzip2
On Mon, 5 Oct 2009, Justin B Rye wrote:
Meanwhile the archives already have pbzip2, "parallel bzip2
implementation". As far as I can see lbzip2 is just another
pthreads-using bzip2 indistinguishable from pbzip2.
This is factually wrong. If this was the case, I would have never asked
for a sponsor, and lbzip2 wouldn't have gotten past my sponsor. See [0].
I did read the "vanity packages" paragraph in the debian-mentors FAQ.
There's nothing in either package description that would help me decide
which to install, except that pbzip2 gives me the hint that there's no
point having either on my old uniprocessor desktop.
The long description of lbzip2 details the exact performance gap left by
pbzip2 that lbzip2 covers.
There's a use for lbzip2 on single-core machines too, because it contains
internal buffering for single-worker modes too. Try to tar+bz2 a large
source tree, like the kernel tree, and watch the processor load on some
desktop applet closely. bzip2 reads (per default) in blocks of around
900K, then goes to work for a long time and doesn't read, doesn't write
during that period. Then it emits the compressed data.
Since the pipe buffer size is 4K under Linux (most of the time, I guess)
and unchangeable by way of "ulimit -p", this leads to tar (IO) effectively
excluding bzip2 (CPU) and vice versa. While bzip2 is working, tar quickly
fills the 4K pipe buffer and blocks on writing. When bzip2 finally awakes,
it suddenly wants to slurp 900K of data, which tar is unable to produce
immediately, since it was blocked on writing, and it doesn't read ahead.
Thus bzip2 lollops in an idle read loop while tar hunts together the next
900K from blocks scattered around the disk. The processor load will
resemble a square wave.
Thus the full duration is
tar time + bzip2 time
If you can increase the pipe buffer size or insert a buffering app between
tar and bzip2, then you can overlap IO with CPU, and the full duration
will be
max {tar time, bzip2 time}
For your uniprocessor machine, lbzip2 would do just that. I use lbzip2 on
a single-core cygwin installation when the need arises.
The ChangeLog entry for lbzip2-0.06, dated 16-Sep-2008, reads, in
part,
"When decompressing with a single worker thread, lbzip2 was previously 45%
slower than standard bzip2. The new, dedicated single-worker decompressor
is only 3% slower, and provides input and output buffering, which is
useful in pipelines and on network file systems. Hence using lbzip2 incurs
virtually no performance penalty over bzip2 even on a single-core
machine."
(It only talks about the single-worker *de*compressor because the
compressor inherently works okay on single core.)
Notice how your (perfectly valid) remark:
There's nothing in either package description that would help me decide
which to install
could be fixed by copying *more* technical information from the
documentation into the long package description, while Ben advises exactly
against that.
Of course, there's a root cause for this: the bz2 file format was never
designed for multi-threaded usage, the bzip2 program wasn't even designed
as a library originally, see [1]. Thus the multi-threaded implementation
landscape is fragmented (there are also clustered, multi-node versions),
and users have a very hard time choosing. I really recommend reading [0]
for comparison, as well as [2]. I was very careful to include the raison
d'etre of lbzip2 into the long description.
If I manage to create a patch for tar to check for / use lbzip2/pbzip2,
with the help of the GNU Tar maintainer, as Paul Wise suggested, maybe
that would ease this burden.
Seconded. The users installing the binary don't care *how* it
works; they may never have heard of POSIX threads. Focus more on
what the program is useful for - it's a compression tool, compatible
with normal bzip2, but designed to take advantage of the features of
multi-core CPUs.
Thank you for this good idea, now I'm conviced the package desc needs a
better introduction.
Is there any hope of the lbzip2 and pbzip2 projects joining forces?
Or at least synchronising their efforts via mutexes?
I don't think so. They are different multithreaded applications that
happen to use the bzip2 library.
Oh; going to the homepage (which I notice redirects me to
http://lacos.hu),
Yes, I moved my page from an ad-financed free public provider to my own
domain. I figure I still need the old site for a link, for a while.
Dangling pointers lead to undefined behavior.
I see it's described there as "a multi-threaded
bzip2/bunzip2 filter that doesn't depend on the lseek() system call
and so isn't restricted to regular files." It had never occurred to
me that ordinary bzip couldn't compress block devices (etc); if
that's an important difference between lbzip2 and other
implementations it should probably be emphasised.
It is, quoting from the long desc:
"It isn't restricted to regular files on input, nor output."
And this is not a distinguishing feature in contrast to standard bzip2, it
is one in contrast to the other parallel bzip2 implementations. They
cannot decompress with multiple threads from a pipe. See [0].
Cheers,
lacos
[0] http://lists.debian.org/debian-mentors/2009/02/msg00135.html
[1] http://bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits
[2] http://www.mediawiki.org/wiki/Dbzip2#Feature_comparison
Reply to: