Re: request for review: lbzip2

To: debian-l10n-english@lists.debian.org
Subject: Re: request for review: lbzip2
From: ERSEK Laszlo <lacos@caesar.elte.hu>
Date: Tue, 6 Oct 2009 03:58:31 +0200 (CEST)
Message-id: <[🔎] Pine.LNX.4.64.0910060307530.27351@login01.caesar.elte.hu>
In-reply-to: <[🔎] 20091005113938.GA9325@xibalba.demon.co.uk>
References: <[🔎] Pine.LNX.4.64.0910050356210.17102@login01.caesar.elte.hu> <[🔎] 87ws3awjzb.fsf@benfinney.id.au> <[🔎] 20091005113938.GA9325@xibalba.demon.co.uk>

On Mon, 5 Oct 2009, Justin B Rye wrote:

Meanwhile the archives already have pbzip2, "parallel bzip2
implementation".  As far as I can see lbzip2 is just another
pthreads-using bzip2 indistinguishable from pbzip2.

This is factually wrong. If this was the case, I would have never askedfor a sponsor, and lbzip2 wouldn't have gotten past my sponsor. See [0].


I did read the "vanity packages" paragraph in the debian-mentors FAQ.

There's nothing in either package description that would help me decidewhich to install, except that pbzip2 gives me the hint that there's nopoint having either on my old uniprocessor desktop.

The long description of lbzip2 details the exact performance gap left bypbzip2 that lbzip2 covers.

There's a use for lbzip2 on single-core machines too, because it containsinternal buffering for single-worker modes too. Try to tar+bz2 a largesource tree, like the kernel tree, and watch the processor load on somedesktop applet closely. bzip2 reads (per default) in blocks of around900K, then goes to work for a long time and doesn't read, doesn't writeduring that period. Then it emits the compressed data.

Since the pipe buffer size is 4K under Linux (most of the time, I guess)and unchangeable by way of "ulimit -p", this leads to tar (IO) effectivelyexcluding bzip2 (CPU) and vice versa. While bzip2 is working, tar quicklyfills the 4K pipe buffer and blocks on writing. When bzip2 finally awakes,it suddenly wants to slurp 900K of data, which tar is unable to produceimmediately, since it was blocked on writing, and it doesn't read ahead.Thus bzip2 lollops in an idle read loop while tar hunts together the next900K from blocks scattered around the disk. The processor load willresemble a square wave.


Thus the full duration is

  tar time + bzip2 time

If you can increase the pipe buffer size or insert a buffering app betweentar and bzip2, then you can overlap IO with CPU, and the full durationwill be


  max {tar time, bzip2 time}

For your uniprocessor machine, lbzip2 would do just that. I use lbzip2 ona single-core cygwin installation when the need arises.

The ChangeLog entry for lbzip2-0.06, dated 16-Sep-2008, reads, inpart,


  "When decompressing with a single worker thread, lbzip2 was previously 45%
  slower than standard bzip2. The new, dedicated single-worker decompressor
  is only 3% slower, and provides input and output buffering, which is
  useful in pipelines and on network file systems. Hence using lbzip2 incurs
  virtually no performance penalty over bzip2 even on a single-core
  machine."

(It only talks about the single-worker *de*compressor because thecompressor inherently works okay on single core.)


Notice how your (perfectly valid) remark:

There's nothing in either package description that would help me decide
which to install

could be fixed by copying *more* technical information from thedocumentation into the long package description, while Ben advises exactlyagainst that.

Of course, there's a root cause for this: the bz2 file format was neverdesigned for multi-threaded usage, the bzip2 program wasn't even designedas a library originally, see [1]. Thus the multi-threaded implementationlandscape is fragmented (there are also clustered, multi-node versions),and users have a very hard time choosing. I really recommend reading [0]for comparison, as well as [2]. I was very careful to include the raisond'etre of lbzip2 into the long description.

If I manage to create a patch for tar to check for / use lbzip2/pbzip2,with the help of the GNU Tar maintainer, as Paul Wise suggested, maybethat would ease this burden.

Seconded.  The users installing the binary don't care *how* it
works; they may never have heard of POSIX threads.  Focus more on
what the program is useful for - it's a compression tool, compatible
with normal bzip2, but designed to take advantage of the features of
multi-core CPUs.

Thank you for this good idea, now I'm conviced the package desc needs abetter introduction.

Is there any hope of the lbzip2 and pbzip2 projects joining forces?
Or at least synchronising their efforts via mutexes?

I don't think so. They are different multithreaded applications thathappen to use the bzip2 library.

Oh; going to the homepage (which I notice redirects me to
http://lacos.hu),

Yes, I moved my page from an ad-financed free public provider to my owndomain. I figure I still need the old site for a link, for a while.Dangling pointers lead to undefined behavior.

I see it's described there as "a multi-threaded
bzip2/bunzip2 filter that doesn't depend on the lseek() system call
and so isn't restricted to regular files."  It had never occurred to
me that ordinary bzip couldn't compress block devices (etc); if
that's an important difference between lbzip2 and other
implementations it should probably be emphasised.


It is, quoting from the long desc:

  "It isn't restricted to regular files on input, nor output."

And this is not a distinguishing feature in contrast to standard bzip2, itis one in contrast to the other parallel bzip2 implementations. Theycannot decompress with multiple threads from a pipe. See [0].



Cheers,
lacos


[0] http://lists.debian.org/debian-mentors/2009/02/msg00135.html
[1] http://bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits
[2] http://www.mediawiki.org/wiki/Dbzip2#Feature_comparison

Reply to:

Follow-Ups:
- Re: request for review: lbzip2
  - From: Ben Finney <ben+debian@benfinney.id.au>
- Re: request for review: lbzip2
  - From: Justin B Rye <jbr@edlug.org.uk>

References:
- request for review: lbzip2
  - From: ERSEK Laszlo <lacos@caesar.elte.hu>
- Re: request for review: lbzip2
  - From: Ben Finney <ben+debian@benfinney.id.au>
- Re: request for review: lbzip2
  - From: Justin B Rye <jbr@edlug.org.uk>

Prev by Date: Re: request for review: lbzip2
Next by Date: Re: request for review: lbzip2
Previous by thread: Re: request for review: lbzip2
Next by thread: Re: request for review: lbzip2
Index(es):
- Date
- Thread