Bug#778955: lintian: suggest check html <img>s included in package

To: Kevin Ryde <user42_kevin@yahoo.com.au>, 778955@bugs.debian.org
Subject: Bug#778955: lintian: suggest check html <img>s included in package
From: Niels Thykier <niels@thykier.net>
Date: Sun, 22 Feb 2015 13:17:05 +0100
Message-id: <[🔎] 54E9C8C1.1070608@thykier.net>
Reply-to: Niels Thykier <niels@thykier.net>, 778955@bugs.debian.org
In-reply-to: <[🔎] 87fv9yefmm.fsf@blah.blah>
References: <[🔎] 87fv9yefmm.fsf@blah.blah>

Control: tags -1 moreinfo

On 2015-02-22 05:26, Kevin Ryde wrote:
> Package: lintian
> Version: 2.5.30+deb8u3
> Severity: wishlist
> Tags: patch
> 
> If a .html file is in a package then usually its <img> files should be
> in the package too so it displays nicely.  I suggest the few lines below
> to check this.
> 
> Without picking on any particular maintainers, missing images can be
> found in for example
> * whizzytex where /usr/share/doc/whizzytex/whizzytex.html is missing
>   whizzytex001.png (and two others)
> * texlive-pictures-doc (very big) where
>   /usr/share/doc/texlive-doc/latex/mathspic/sourcecode113.html is
>   missing a fig1.jpg deep in its detailed description
> 

Hi Kevin,

Thanks for writing a lintian check for this.  It is indeed an
interesting proposal.

I do have some concerns on the performance front.  On some packages,
this will be the "second slowest" check taking 10s or more.  E.g.
lazarus-doc and php5-doc contain quite a few HTML files[1].

It is possible that /some/ of this will be solved by merging it with the
code from checks/files.pm that do some checking of HTML files (would at
least save reading the file twice).

> I'm unsure if my code notices images supplied by dependent packages.
> I put a group bit like the manpages and symlinks checks, but I don't
> really understand when packages are a group.  Eg. per html.pm comments,
> texlive-lang-french uses images from texlive-base and has a correct
> declared dependency, but I couldn't make the right incantation to have
> it recognised :-(.
> 

I suspect it is correct.  However, it requires that the binaries are
built from the same source.  Accordingly, it would never work with
texline-lang-french and texlive-base as they are from different source
packages.

> Incidentally HTML::Parser would be a more reliable html parse of course.
> But are lintian dependencies supposed to be kept down?  I see another
> rough html parse in files.pm for privacy breaches.  A good parse might
> help accuracy there against obscure quoting or escaping.
> 

Depends on what we are pulling in.  The libhtml-parser-perl (and
libhtml-tagset-perl) seem (at first glance) to increase the footprint
with 0.3MB.  With it already been in stable, it likely to be an
acceptable extra dependency.
  To be honest, I am also interested in the performance characteristics
of using HTML::Parser over the current approach.  Especially if it can
be used to enhance the performance of our similar checks (e.g. the
privacy breaker one in c/files.pm).

> I thought separate html.pm script to leave room for other checks related
> to html parse (whatever method).  Maybe similar treatment of css or
> javascript (though I don't rate those), even some href checking.  No
> full link checker, but detect document parts apparently missing from a
> package.
> 
> [...]
> 

That could make sense - I am thinking it would make sense to move the
privacy breaker checks into the http-check file as well.  Currently, it
scans all files matching:

  $fname =~ m,\.(?:x?html?|js|xht|xml|css)$,i

Which seems fairly compatible with a http check.

Thanks,
~Niels

[1] A slightly longer list of packages to choose from:

  23437 freefoam-dev-doc
  19348 lazarus-doc-1.2.4
  18266 libreoffice-dev-doc
  17346 libgcj-doc
  13280 libboost1.55-doc
  13159 libboost1.54-doc
  12532 php-doc
  12455 vtk6-doc
  12285 fp-docs-2.6.4
  11929 openjdk-8-doc
  10873 liblapack-doc
  10845 openjdk-7-doc
  10288 openjdk-6-doc
  10163 vtk-doc
   9473 pike7.8-reference

Computed by:
   apt-file search '.htm' | grep -E '\.html?$' | cut -f1 -d':' | \
   sort | uniq -c | sort --numeric --reverse

Reply to:

Follow-Ups:
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Kevin Ryde <user42_kevin@yahoo.com.au>

References:
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Kevin Ryde <user42_kevin@yahoo.com.au>

Prev by Date: Processed: Re: lintian: unsubstituted #!perl
Next by Date: Processed: Re: Bug#778955: lintian: suggest check html <img>s included in package
Previous by thread: Bug#778955: lintian: suggest check html <img>s included in package
Next by thread: Bug#778955: lintian: suggest check html <img>s included in package
Index(es):
- Date
- Thread