[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#677874: lintian: checks/manpages is unreasonably slow on some packages



On Sun, Jun 17, 2012 at 02:15:31PM +0200, Niels Thykier wrote:
> I noticed [1] and decided to check what made Lintian a "lengthy
> invariant".  Processing the changes (and related files) took about a
> minute (accoriding to the shell built-in time).  Running:
> 
>  $ time lintian -d -C manpages allegro4-doc_4.4.2-2_all.deb
> 
> takes about 40 seconds.
> 
> The bottleneck appears to our calls to "man" in checks/manpages.
> Manually running man on all the manpages takes roughly 30 seconds.  As
> far as I can tell, man is "just slow" (at least with currently
> selected options).

A good deal of this is just death-by-a-thousand-cuts rather than any
single thing being desperately slow; it's not unreasonably slow for
interactive use, but it's being run 823 times here, and it has to spawn
a lot of subprocesses because the full warnings check necessarily
involves invoking nroff, which isn't lightweight.

I've never attempted to optimise the manpages check before, though, and
so there's some scope for easy improvements: each subprocess is
expensive when you multiply them up, so let's look at which ones are
obviously unnecessary.  (I can't get any accurate timings just now
because my backups are running.)

Setting MANROFFSEQ to empty in the environment would get rid of a call
to tbl for most pages; this would mean that lintian is stricter about
pages declaring their preprocessors with '\" lines (i.e.  pages that
need tbl would have to say  '\" t  at the top), but as long as we
document this in the info text for the relevant check I would say that a
bit of extra strictness is perfectly acceptable in the context of
lintian, certainly if it comes with a performance advantage.

Adding the '-Tutf8 -Z' options to man would cause it to only run pages
through the parsing half of the groff pipeline, and not bother with
formatting them for display using grotty or processing the output
through col.

On the lintian side, it would be worth taking some steps to avoid
running commands using the shell (e.g. the list forms of open and exec
with some manual redirections).  Each one doesn't take very long but
they add up.  Also, we might as well use 'gzip -cd' directly rather than
running through the zcat wrapper script every time.

How far does all this get you?  Given the current timings, I'd have
thought that even fractional improvements would be worthwhile.

> Running man in a collection is unlikely to yield any noticable
> improvement[2].  Even with xargs we are looking at at least 25 seconds
> plus man is unhelpful in this case[3].
[...]
> [3] It emits errors when running with xargs that do not occur when
> running them in serial.

Can you give me an example yielding such a difference?

> The error messages all use "<standard input>" rather than a filename,
> so it will be... difficult to relate them to the original manpage.

Indeed.  This is really groff being unhelpful, not man; convincing groff
to output a more useful file name would appear to require man to write
out a temporary file, which wouldn't be terribly clever for I/O.  I
suppose we could have man postprocess groff's error messages, or write
out a status line at the start of processing each file so that lintian
could know what "<standard input>" following that line means, or
something like that.

-- 
Colin Watson                                       [cjwatson@debian.org]



Reply to: