Bug#778955: lintian: suggest check html <img>s included in package
Control: tags -1 moreinfo
On 2015-02-22 05:26, Kevin Ryde wrote:
> Package: lintian
> Version: 2.5.30+deb8u3
> Severity: wishlist
> Tags: patch
>
> If a .html file is in a package then usually its <img> files should be
> in the package too so it displays nicely. I suggest the few lines below
> to check this.
>
> Without picking on any particular maintainers, missing images can be
> found in for example
> * whizzytex where /usr/share/doc/whizzytex/whizzytex.html is missing
> whizzytex001.png (and two others)
> * texlive-pictures-doc (very big) where
> /usr/share/doc/texlive-doc/latex/mathspic/sourcecode113.html is
> missing a fig1.jpg deep in its detailed description
>
Hi Kevin,
Thanks for writing a lintian check for this. It is indeed an
interesting proposal.
I do have some concerns on the performance front. On some packages,
this will be the "second slowest" check taking 10s or more. E.g.
lazarus-doc and php5-doc contain quite a few HTML files[1].
It is possible that /some/ of this will be solved by merging it with the
code from checks/files.pm that do some checking of HTML files (would at
least save reading the file twice).
> I'm unsure if my code notices images supplied by dependent packages.
> I put a group bit like the manpages and symlinks checks, but I don't
> really understand when packages are a group. Eg. per html.pm comments,
> texlive-lang-french uses images from texlive-base and has a correct
> declared dependency, but I couldn't make the right incantation to have
> it recognised :-(.
>
I suspect it is correct. However, it requires that the binaries are
built from the same source. Accordingly, it would never work with
texline-lang-french and texlive-base as they are from different source
packages.
> Incidentally HTML::Parser would be a more reliable html parse of course.
> But are lintian dependencies supposed to be kept down? I see another
> rough html parse in files.pm for privacy breaches. A good parse might
> help accuracy there against obscure quoting or escaping.
>
Depends on what we are pulling in. The libhtml-parser-perl (and
libhtml-tagset-perl) seem (at first glance) to increase the footprint
with 0.3MB. With it already been in stable, it likely to be an
acceptable extra dependency.
To be honest, I am also interested in the performance characteristics
of using HTML::Parser over the current approach. Especially if it can
be used to enhance the performance of our similar checks (e.g. the
privacy breaker one in c/files.pm).
> I thought separate html.pm script to leave room for other checks related
> to html parse (whatever method). Maybe similar treatment of css or
> javascript (though I don't rate those), even some href checking. No
> full link checker, but detect document parts apparently missing from a
> package.
>
> [...]
>
That could make sense - I am thinking it would make sense to move the
privacy breaker checks into the http-check file as well. Currently, it
scans all files matching:
$fname =~ m,\.(?:x?html?|js|xht|xml|css)$,i
Which seems fairly compatible with a http check.
Thanks,
~Niels
[1] A slightly longer list of packages to choose from:
23437 freefoam-dev-doc
19348 lazarus-doc-1.2.4
18266 libreoffice-dev-doc
17346 libgcj-doc
13280 libboost1.55-doc
13159 libboost1.54-doc
12532 php-doc
12455 vtk6-doc
12285 fp-docs-2.6.4
11929 openjdk-8-doc
10873 liblapack-doc
10845 openjdk-7-doc
10288 openjdk-6-doc
10163 vtk-doc
9473 pike7.8-reference
Computed by:
apt-file search '.htm' | grep -E '\.html?$' | cut -f1 -d':' | \
sort | uniq -c | sort --numeric --reverse
Reply to: