[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: request for review: lbzip2

On Mon, 5 Oct 2009, Justin B Rye wrote:

Meanwhile the archives already have pbzip2, "parallel bzip2
implementation".  As far as I can see lbzip2 is just another
pthreads-using bzip2 indistinguishable from pbzip2.

This is factually wrong. If this was the case, I would have never asked for a sponsor, and lbzip2 wouldn't have gotten past my sponsor. See [0].

I did read the "vanity packages" paragraph in the debian-mentors FAQ.

There's nothing in either package description that would help me decide which to install, except that pbzip2 gives me the hint that there's no point having either on my old uniprocessor desktop.

The long description of lbzip2 details the exact performance gap left by pbzip2 that lbzip2 covers.

There's a use for lbzip2 on single-core machines too, because it contains internal buffering for single-worker modes too. Try to tar+bz2 a large source tree, like the kernel tree, and watch the processor load on some desktop applet closely. bzip2 reads (per default) in blocks of around 900K, then goes to work for a long time and doesn't read, doesn't write during that period. Then it emits the compressed data.

Since the pipe buffer size is 4K under Linux (most of the time, I guess) and unchangeable by way of "ulimit -p", this leads to tar (IO) effectively excluding bzip2 (CPU) and vice versa. While bzip2 is working, tar quickly fills the 4K pipe buffer and blocks on writing. When bzip2 finally awakes, it suddenly wants to slurp 900K of data, which tar is unable to produce immediately, since it was blocked on writing, and it doesn't read ahead. Thus bzip2 lollops in an idle read loop while tar hunts together the next 900K from blocks scattered around the disk. The processor load will resemble a square wave.

Thus the full duration is

  tar time + bzip2 time

If you can increase the pipe buffer size or insert a buffering app between tar and bzip2, then you can overlap IO with CPU, and the full duration will be

  max {tar time, bzip2 time}

For your uniprocessor machine, lbzip2 would do just that. I use lbzip2 on a single-core cygwin installation when the need arises.

The ChangeLog entry for lbzip2-0.06, dated 16-Sep-2008, reads, in part,

  "When decompressing with a single worker thread, lbzip2 was previously 45%
  slower than standard bzip2. The new, dedicated single-worker decompressor
  is only 3% slower, and provides input and output buffering, which is
  useful in pipelines and on network file systems. Hence using lbzip2 incurs
  virtually no performance penalty over bzip2 even on a single-core

(It only talks about the single-worker *de*compressor because the compressor inherently works okay on single core.)

Notice how your (perfectly valid) remark:

There's nothing in either package description that would help me decide
which to install

could be fixed by copying *more* technical information from the documentation into the long package description, while Ben advises exactly against that.

Of course, there's a root cause for this: the bz2 file format was never designed for multi-threaded usage, the bzip2 program wasn't even designed as a library originally, see [1]. Thus the multi-threaded implementation landscape is fragmented (there are also clustered, multi-node versions), and users have a very hard time choosing. I really recommend reading [0] for comparison, as well as [2]. I was very careful to include the raison d'etre of lbzip2 into the long description.

If I manage to create a patch for tar to check for / use lbzip2/pbzip2, with the help of the GNU Tar maintainer, as Paul Wise suggested, maybe that would ease this burden.

Seconded.  The users installing the binary don't care *how* it
works; they may never have heard of POSIX threads.  Focus more on
what the program is useful for - it's a compression tool, compatible
with normal bzip2, but designed to take advantage of the features of
multi-core CPUs.

Thank you for this good idea, now I'm conviced the package desc needs a better introduction.

Is there any hope of the lbzip2 and pbzip2 projects joining forces?
Or at least synchronising their efforts via mutexes?

I don't think so. They are different multithreaded applications that happen to use the bzip2 library.

Oh; going to the homepage (which I notice redirects me to

Yes, I moved my page from an ad-financed free public provider to my own domain. I figure I still need the old site for a link, for a while. Dangling pointers lead to undefined behavior.

I see it's described there as "a multi-threaded
bzip2/bunzip2 filter that doesn't depend on the lseek() system call
and so isn't restricted to regular files."  It had never occurred to
me that ordinary bzip couldn't compress block devices (etc); if
that's an important difference between lbzip2 and other
implementations it should probably be emphasised.

It is, quoting from the long desc:

  "It isn't restricted to regular files on input, nor output."

And this is not a distinguishing feature in contrast to standard bzip2, it is one in contrast to the other parallel bzip2 implementations. They cannot decompress with multiple threads from a pipe. See [0].


[0] http://lists.debian.org/debian-mentors/2009/02/msg00135.html
[1] http://bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits
[2] http://www.mediawiki.org/wiki/Dbzip2#Feature_comparison

Reply to: