[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#778955: lintian: suggest check html <img>s included in package



Control: tags -1 moreinfo

On 2015-02-22 05:26, Kevin Ryde wrote:
> Package: lintian
> Version: 2.5.30+deb8u3
> Severity: wishlist
> Tags: patch
> 
> If a .html file is in a package then usually its <img> files should be
> in the package too so it displays nicely.  I suggest the few lines below
> to check this.
> 
> Without picking on any particular maintainers, missing images can be
> found in for example
> * whizzytex where /usr/share/doc/whizzytex/whizzytex.html is missing
>   whizzytex001.png (and two others)
> * texlive-pictures-doc (very big) where
>   /usr/share/doc/texlive-doc/latex/mathspic/sourcecode113.html is
>   missing a fig1.jpg deep in its detailed description
> 

Hi Kevin,

Thanks for writing a lintian check for this.  It is indeed an
interesting proposal.

I do have some concerns on the performance front.  On some packages,
this will be the "second slowest" check taking 10s or more.  E.g.
lazarus-doc and php5-doc contain quite a few HTML files[1].

It is possible that /some/ of this will be solved by merging it with the
code from checks/files.pm that do some checking of HTML files (would at
least save reading the file twice).

> I'm unsure if my code notices images supplied by dependent packages.
> I put a group bit like the manpages and symlinks checks, but I don't
> really understand when packages are a group.  Eg. per html.pm comments,
> texlive-lang-french uses images from texlive-base and has a correct
> declared dependency, but I couldn't make the right incantation to have
> it recognised :-(.
> 

I suspect it is correct.  However, it requires that the binaries are
built from the same source.  Accordingly, it would never work with
texline-lang-french and texlive-base as they are from different source
packages.

> Incidentally HTML::Parser would be a more reliable html parse of course.
> But are lintian dependencies supposed to be kept down?  I see another
> rough html parse in files.pm for privacy breaches.  A good parse might
> help accuracy there against obscure quoting or escaping.
> 

Depends on what we are pulling in.  The libhtml-parser-perl (and
libhtml-tagset-perl) seem (at first glance) to increase the footprint
with 0.3MB.  With it already been in stable, it likely to be an
acceptable extra dependency.
  To be honest, I am also interested in the performance characteristics
of using HTML::Parser over the current approach.  Especially if it can
be used to enhance the performance of our similar checks (e.g. the
privacy breaker one in c/files.pm).

> I thought separate html.pm script to leave room for other checks related
> to html parse (whatever method).  Maybe similar treatment of css or
> javascript (though I don't rate those), even some href checking.  No
> full link checker, but detect document parts apparently missing from a
> package.
> 
> [...]
> 

That could make sense - I am thinking it would make sense to move the
privacy breaker checks into the http-check file as well.  Currently, it
scans all files matching:

  $fname =~ m,\.(?:x?html?|js|xht|xml|css)$,i

Which seems fairly compatible with a http check.

Thanks,
~Niels

[1] A slightly longer list of packages to choose from:

  23437 freefoam-dev-doc
  19348 lazarus-doc-1.2.4
  18266 libreoffice-dev-doc
  17346 libgcj-doc
  13280 libboost1.55-doc
  13159 libboost1.54-doc
  12532 php-doc
  12455 vtk6-doc
  12285 fp-docs-2.6.4
  11929 openjdk-8-doc
  10873 liblapack-doc
  10845 openjdk-7-doc
  10288 openjdk-6-doc
  10163 vtk-doc
   9473 pike7.8-reference

Computed by:
   apt-file search '.htm' | grep -E '\.html?$' | cut -f1 -d':' | \
   sort | uniq -c | sort --numeric --reverse


Reply to: