Bug#778955: lintian: suggest check html <img>s included in package

To: 778955@bugs.debian.org
Subject: Bug#778955: lintian: suggest check html <img>s included in package
From: Kevin Ryde <user42_kevin@yahoo.com.au>
Date: Fri, 27 Feb 2015 17:39:34 +1100
Message-id: <[🔎] 87wq33n9i1.fsf@blah.blah>
Reply-to: Kevin Ryde <user42_kevin@yahoo.com.au>, 778955@bugs.debian.org
References: <[🔎] 87fv9yefmm.fsf@blah.blah> <[🔎] 54E9C8C1.1070608@thykier.net> <[🔎] 87h9ucuo2y.fsf@blah.blah> <[🔎] 54EC2123.50006@thykier.net> <[🔎] 87wq3737mf.fsf@blah.blah> <[🔎] 8761apad3s.fsf@blah.blah> <[🔎] 54EE4971.9000703@thykier.net>

I got to the few lines below so far.  speed.pl compares HTML::Parser
against some regexp action.  Amend @filenames for likely .html to
measure.  The parse includes callbacks for 'a' which are not in the real
code yet.

New html.pm uses HTML::Parser and adds css file check.  Some packages
tickle it.  Eg. xterm 312-1 file /usr/share/doc/xterm/xterm.faq.html

Maybe the "$target_exists" bit could become a utility func, since
manpages.pm does the same, and could be a place to later decide how,
when and which contents of other packages can or should be enquired
into.

Some of the privacy breach checks would go in the $start_handler sub but
I haven't tried that.

(I'm not greatly in love with the local subrs.  If it wasn't for needing
$info,$proc,$group in the existence check then could be ordinary subs.)

#!/usr/bin/perl -w
use strict;
use Time::HiRes;
use File::Slurp 'slurp';
use HTML::Parser;
# use File::Map;

my $show = 0;

my $parser = HTML::Parser->new(api_version => 3,
                               start_h => [ \&start_handler, 'tagname,attr' ]);
$parser->report_tags('img',
                     'audio',
                     'video',
                     'link',
                     'a',
                    );
sub start_handler {
  my ($tagname, $attr) = @_;
  if ($show) {
    print "parse $tagname  $attr->{src}\n";
  }
}

my @filenames;
# @filenames = ('/usr/share/doc/vtk6-doc/html/classvtkCurvatures.html');
# @filenames = glob "/usr/share/doc/pike7.8-doc/html/reference/ex/predef_3A_3A/GTK2/*.html";
@filenames = glob "/usr/share/doc/vtk6-doc/html/classvtkC*.html";

$show = (@filenames < 10);
{
  print "pre-slurp\n";
  my $size = 0;
  foreach my $filename (@filenames) {
    $size += length(slurp $filename);
  }
  $size /= 1e6;
  print "done  $size mbytes\n";
}

sub tim {
  return Time::HiRes::clock_gettime(Time::HiRes::CLOCK_PROCESS_CPUTIME_ID());
}

{
  my $t = tim();
  foreach my $filename (@filenames) {
    $parser->parse_file($filename);
  }
  print "HTML::Parser took ",tim()-$t,"\n";
}

{
  my $t = tim();
  foreach my $filename (@filenames) {
    # my $str = slurp $filename;
    # File::Map::map_file(my $str, $filename, '<');

    open my $fh, '<', $filename or die;
    my $str = do { local $/; <$fh> };

    while ($str =~ /<([^>]+)>/g) {
      my $body = $1;
      $body =~ /^(img|video)\b/i or next;
      #                     $1   $2       $3
      $body =~ /\bsrc\s*=\s*(['"]([^"']+)|([^ \t\r\n>]+))/ or next;
      my $target = $2 // $3;
      # <img src="foo.png"> results in $target="foo.png"

      if ($show) {
        print "slurp $target\n";
      }
    }
  }
  print "slurp took ",tim()-$t,"\n";

}

exit 0;

# html -- lintian check script

# Copyright 2015 Kevin Ryde
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program.  If not go to <http://www.gnu.org/licenses/>.


# Performance:
#
# For reference, HTML::Parser seems to run faster than a rough regexp parse,
# and does a very much better job of distinguishing tags from strange text,
# and ignoring <!-- comments -->.

# Other ideas:
#
# * CSS files and inline CSS can @import other css, could check those files
#   exist.  What css parser is good to extract that?
#
# * Could consider either preferring or demanding css to be in local copies,
#   not an external fetch.  Eg.  /usr/share/doc/gcc-4.9-base/NEWS.html uses
#   external http://gcc.gnu.org/gcc.css
#
# * Could link check <a href="foo.html"> to see foo.html exists.  No full
#   link check, just look for things probably meant to be in the package but
#   are apparently missing.
#
#   To allow cross-package links perhaps only links to files in the current
#   directory or below or under /usr/share/doc/PACKAGENAME/.  Might like
#   target files to be mentioned in a "Suggests:" (or higher) dependency,
#   but checking that requires examining arbitrary other packages.

# Bugs:
#
# * Cross-package images and css result in false positives for packages from
#   different sources.
#
#   An example cross-package image is texlive-lang-french (version
#   2014.20141024-1) where
#   /usr/share/doc/texlive-doc/texlive/texlive-fr/texlive-fr.html has
#   src="../texlive-common/install-lnx-main.png" which is in its declared
#   dependency texlive-base.
#
#   An example cross-package css is imagemagick-doc (version 8:6.8.9.9-5)
#   where /usr/share/doc/imagemagick-doc/index.html uses
#   /usr/share/javascript/jquery-fancybox/jquery.fancybox.css from
#   libjs-jquery-fancybox.
#
#   Same-source packages are in $info->group->direct_dependencies($proc) if
#   checked together, but not different source packages.
#


package Lintian::html;
use 5.010;
use strict;
use warnings;

use File::Basename qw(fileparse);
use HTML::Parser;
use URI::Escape qw(uri_unescape);

use Lintian::Tags qw(tag);
use Lintian::Util qw(slurp_entire_file normalize_pkg_path);


# $target is a link string or undef.
# If it's a link to a local file etc then return a normalized pathname.
# If it's undef or some external http:/ etc then return undef.
#
sub target_local_fullname {
    my ($target, $dirname) = @_;
    if (! defined $target) { return undef; }

    # Strip leading and trailing whitespace.
    # HTML4 spec "6.2 SGML basic types" says user agents may ignore
    # leading and trailing whitespace, and for example iceape does that.
    # Do the same here.  Example trailing whitespace is in snd-doc
    # (version 11.7-3)
    # /usr/share/doc/snd-doc/HTML/tutorial/2_custom_snd.html which has
    # <img src="images/jpg/2_03-snd_horizontal.jpg ">
    $target =~ s/^\s+//;
    $target =~ s/\s+$//;

    if ($target =~ /[a-z]+:/) {
        # Have a schema:

        # Strip "file:" so that file:/foo.png becomes /foo.png.
        # This occurs in various href="", but usually not src="".  In
        # any case it's a local file to check.
        unless ($target =~ s{^file:}{}) {
            # Any other schema is something like "http:" external or "data:"
            # inline, or "resource:" netsurf specific, all of which are not
            # local files.
            return undef;
        }
    }

    # decode percent escapes %20 etc (unlikely, but possible)
    $target = uri_unescape($target);

    return normalize_pkg_path($dirname, $target);
}

sub run {
    my (undef, undef, $info, $proc, $group) = @_;

    # Return true if filename $target_fullname exists in current package or
    # its dependents.
    my $target_exists = sub {
        my ($target_fullname) = @_;
        if ($info->index_resolved_path($target_fullname)) {
            return 1;
        }
        # Check our dependencies:
        my $deps = $group->info->direct_dependencies($proc);
        foreach my $depproc (@$deps) {
            my $info = $depproc->info;
            my $f = $info->index_resolved_path($target_fullname);
            if ($f && $f->is_file) {
                return 1;
            }
        }
        return 0;
    };

    # $file is the Lintian::Path of the html file being checked.
    # $dirname is its directory (and $basename its foo.html).
    #
    my ($file, $dirname, $basename);

    # $target is an attribute target src="" or href="" etc, or possibly undef.
    # Return $target if it's a local file but missing from the package.
    # Return undef if it's ok (including external, undef, etc).
    my $target_missing = sub {
        my ($target) = @_;
        if (defined(my $target_fullname
                    = target_local_fullname($target, $dirname))) {
            if (! $target_exists->($target_fullname)) {
                return $target;
            }
        }
        return undef;
    };

    my $start_handler = sub {
        my ($tagname, $attr) = @_;

        # <link rel="stylesheet" href="foo.css"> should have foo.css
        if ($tagname eq 'link'
            && ($attr->{'rel'}//'') eq 'stylesheet'
            && defined(my $target = $target_missing->($attr->{'href'}))) {
            tag 'html-missing-css-file', $file, $target;
        }

        # <img src="foo.png"> should have foo.png
        # <audio src="foo.ogg"> should have foo.ogg
        # <video src="foo.ogv"> should have foo.ogv
        if (defined(my $target = $target_missing->($attr->{'src'}))) {
            tag 'html-missing-image-file', $file, $target;
        }
    };

    my $parser = HTML::Parser->new
      (api_version => 3,
       start_h => [ $start_handler, 'tagname,attr' ]);
    # only call $start_handler for these tags
    $parser->report_tags('img',
                         'audio',  # HTML5
                         'video',  # HTML5
                         'link',
                        );

    foreach my $ifile ($info->sorted_index) {
        # Parse each HTML file in the package.
        # .html.gz is unusual, but for example in lynx-cur 2.8.9dev1-2+b1.
        # .xhtml is even more unusual, and might be no more than example files.
        if ($ifile =~ /\.x?html?(\.gz)?$/i && $ifile->is_file) {
            $file = $ifile;
            ($basename, $dirname) = fileparse($file);
            my $fh = ($file =~ /\.gz$/ ? $file->open_gz : $file->open);

            if (defined(my $non_bom = fh_possible_bom($fh))) {
                $parser->utf8_mode(1); # callback raw utf8 bytes for entities
                # give the parser the $non_bom bytes then the rest of the file
                $parser->parse($non_bom);
            } else {
                $parser->utf8_mode(0);
            }
            $parser->parse_file($fh);
        }
    }
    return;
}


my %bom_to_layer = ("\xFE\xFF" => ':encoding(UTF-16BE)',
                    "\xFF\xFE" => ':encoding(UTF-16LE)');

# Read the first two bytes of $fh looking for a unicode BOM.
# If found then push an ":encoding()" layer and return undef.
# If not found then return the 2 bytes read.
#
# BOM occurs for example libxml-libxml-perl 2.0116+dfsg-1+b1
# /usr/share/doc/libxml-libxml-perl/examples/utf-16-1.html
#
# HTML::Parser doesn't by itself look for bom or charset="".  We can get
# away with not decoding 8-bit, but for bigger must ensure HTML::Parser sees
# characters.  Failing to do so provokes warnings from HTML::Parser.
#
sub fh_possible_bom {
    my ($fh) = @_;

    my $bytes;
    read($fh,$bytes,2);
    $bytes //= '';
    if (my $layer = $bom_to_layer{$bytes}) {
        binmode $fh, $layer;
        return undef;
    } else {
        return $bytes;  # not a BOM
    }
}

1;

# Local Variables:
# indent-tabs-mode: nil
# cperl-indent-level: 4
# End:
# vim: syntax=perl sw=4 sts=4 sr et

Check-Script: html
Type: binary
Needs-Info: unpacked, file-info
Info: This script checks HTML file content.

Tag: html-missing-image-file
Severity: normal
Certainty: possible
Info: HTML file missing an &lt;img&gt; or similar file.
 Generally a HTML file in a package should have its image files
 packaged too, and in the right place.
 .
 If an image is only some candy then missing it doesn't matter very
 much, but the aim would still be to have the packaged page look good.
 If an image is something important like a technical diagram then
 missing it might make the HTML almost useless.
 .
 If a logo or similar is not freely redistributable then it will be
 deliberately omitted.  Lintian can't distinguish that from mistaken
 omission.  If changing to an external image then usually a link &lt;a
 href=""&gt; is preferred over &lt;img src=""&gt;, for users' privacy.
 .
 If some HTML is a template then its links might not exist yet.
 Lintian can't distinguish that from links which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Beware absolute paths like src="/foo.png".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Images supplied by a dependent package might not be noticed, giving
 false reports.  Packages from the same source should work if checked
 as a group.

Tag: html-missing-css-file
Severity: normal
Certainty: possible
Info: HTML file missing a CSS stylesheet file.
 Generally a HTML file in a package should have its &lt;link
 rel="stylesheet"> CSS files packaged too, and in the right place.
 .
 A missing CSS usually leaves the html still readable, but not in the
 author's intended display style.
 .
 If some HTML is a template then its CSS might not exist yet.  Lintian
 can't distinguish that from things which ought to have been filled in
 but are not.  The suggestion would be to ignore reports on templates
 or add lintian overrides.
 .
 Beware absolute paths like href="/foo.css".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 CSS supplied by a dependent package might not be noticed, giving
 false reports.  Packages from the same source should work if checked
 as a group.

Reply to:

References:
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Kevin Ryde <user42_kevin@yahoo.com.au>
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Niels Thykier <niels@thykier.net>
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Kevin Ryde <user42_kevin@yahoo.com.au>
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Niels Thykier <niels@thykier.net>
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Kevin Ryde <user42_kevin@yahoo.com.au>
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Kevin Ryde <user42_kevin@yahoo.com.au>
- Bug#778955: lintian: suggest check html <img>s included in package
  - From: Niels Thykier <niels@thykier.net>

Prev by Date: Bug#778955: lintian: suggest check html <img>s included in package
Next by Date: Bug#779436: lintian: /me/ missing on https://lintian.debian.org/maintainers.html
Previous by thread: Bug#778955: lintian: suggest check html <img>s included in package
Next by thread: Bug#778955: lintian: suggest check html <img>s included in package
Index(es):
- Date
- Thread