[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#778955: lintian: suggest check html <img>s included in package



I went a bit further with the few lines below.  It adds some css,
favicon and link checking.

I reduced the check to targets under /usr/share/doc/PACKAGENAME/ since
they ought to exist in the package or a package from the same source.
This helps avoid false positives of cross-package links.  Don't want a
new check to start its life with lots of false reports. :-)

I looked at various bits tickled in packages I have and they seem real.
It even picked up link typos in lintian itself,

    W: lintian: html-missing-href-file
    /usr/share/doc/lintian/api.html/Lintian/Processable.html
    ../Lintain/ProcessableGroup.html

("Lintain" instead of "Lintian" in some of the POD.)

Other reports for lintian itself are about missing IPC/Run.html etc,
since there isn't a full set of module docs there.  I put a note in
html.desc on this as a general problem.  The suggestion to amend such
links might be a chore, but they go nowhere as they stand, and fixing
can maximize usefulness.  If a few packages do similar then maybe a
shared mangling script could make life easier.

# html -- lintian check script

# Copyright 2015 Kevin Ryde
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program.  If not go to <http://www.gnu.org/licenses/>.


# For reference, HTML::Parser measures faster than a rough regexp parse, and
# does a very much better job of distinguishing tags from strange text and
# ignoring <!-- comments -->.
#
# There are a few <link rev=""> types (conceptual reverses of rel="").  The
# only one found in practice is rev="made" giving an author email.  Don't
# think there's anything in them to be checked.
#
# Other Ideas:
#
# * CSS files and inline CSS can @import other css, could check that those
#   files exist.  What css parser is good to extract such imports?
#
# * Could consider preferring or demanding css to be in local copies, not an
#   external fetch.  Eg. /usr/share/doc/gcc-4.9-base/NEWS.html in
#   gcc-4.9-base 4.9.2-10 uses external http://gcc.gnu.org/gcc.css
#
# * Would like to check all link targets (not just those under the package's
#   own /usr/share/doc/PACKAGENAME/ as currently done).  That means checking
#   for files in other packages.  Packages from the same sources are in
#   $info->group->direct_dependencies($proc) if they're checked together,
#   but packages from different sources are not.
#
#   An example cross-package image is texlive-lang-french (version
#   2014.20141024-1) where
#   /usr/share/doc/texlive-doc/texlive/texlive-fr/texlive-fr.html has
#   src="../texlive-common/install-lnx-main.png" which is from texlive-base
#   (a declared dependency).
#
#   An example cross-package css is imagemagick-doc (version 8:6.8.9.9-5)
#   where /usr/share/doc/imagemagick-doc/index.html uses
#   /usr/share/javascript/jquery-fancybox/jquery.fancybox.css from
#   libjs-jquery-fancybox (a declared dependency).
#

package Lintian::html;
use 5.010;
use strict;
use warnings;

use File::Basename qw(fileparse);
use HTML::Parser;
use URI::Escape qw(uri_unescape);

use Lintian::Tags qw(tag);
use Lintian::Util qw(slurp_entire_file normalize_pkg_path);


# $target is the contents of an href="" link, or undef.
# If it's a link to a local file etc then return a Lintian normalized pathname.
# If it's undef or some external http:/ etc then return undef.
#
sub target_local_fullname {
    my ($target, $dirname) = @_;
    if (! defined $target) { return undef; }

    # Strip leading and trailing whitespace.
    # HTML4 spec "6.2 SGML basic types" says user agents may ignore
    # leading and trailing whitespace, and for example iceape does that.
    # Do the same here.
    # Example trailing whitespace is in snd-doc 11.7-3 where
    # /usr/share/doc/snd-doc/HTML/tutorial/2_custom_snd.html has
    # <img src="images/jpg/2_03-snd_horizontal.jpg ">
      $target =~ s/^\s+//;
    $target =~ s/\s+$//;

    # Usually schemas are lower case "ftp:" etc.
    # Lynx has some upper case private specials like "LYNXKEYMAP:".
    if ($target =~ /[a-z]+:/i) {
        # Have a "schema:" on the link.

        # Strip "file:" so that file:/foo.png becomes /foo.png.
        # This occurs in various href="", though usually not src="".  In
        # any case it's a local file to check.
        # The file: schema ought to be file:///foo.png, but is often just
        # file:/foo.png single slash.  All slashes are crunched by
        # normalize_pkg_path() below.
        #
        unless ($target =~ s{^file:}{}) {

            # Anything except file: is "http:" external or "data:" inline,
            # or "resource:" netsurf specific, etc, all of which are not
            # local files.
            return undef;
        }
    }

    # Strip anchor fragment part, so "foo.html#section" becomes "foo.html"
    $target =~ s/#.*$//;

    # Strip CGI query part so "foo.html?some=thing" becomes "foo.html".  Not
    # sure how many browsers accept this sort of thing in files (rather than
    # a server).  Iceape can do some things with it.
    $target =~ s/#.*$//;

    # decode escapes %20 etc
    $target = uri_unescape($target);

    return normalize_pkg_path($dirname, $target);
}

my %tag_is_image = (img => 1,
                    audio => 1,
                    video => 1);

my %rel_is_stylesheet = (stylesheet => 1,
                         'alternate stylesheet' => 1);
my %rel_is_favicon = (icon => 1,
                      'shortcut icon' => 1);

my %target_is_makeinfo = ('dir.html#Top'      => 1, # single file
                          '../dir/index.html' => 1, # multi-file
                          '../DIR/index.html' => 1, # multi-file
                          '../index.html#dir' => 1, # makefile 4.13
                          'DIR.html#Top'      => 1,
                         );

# Regexp matching "foo" or "foo/bar", with maximum one "/", which is the
# Lintian normalized form for a filename in the root directory or one level
# below the root.
my $toplevel_filename_re = qr{^[^/]+(/[^/]*)?$};

sub run {
    my (undef, undef, $info, $proc, $group) = @_;

    my $pkg_name = $proc->pkg_name;
    my $usr_share_doc_pkg_name_re = qr{^usr/share/doc/\Q$pkg_name/};

    # $file is the Lintian::Path of the html file being checked.
    # $dirname is its directory (and $basename its foo.html).
    # $dirname includes a trailing "/".
    #
    my ($file, $dirname, $basename);

    # Return true if filename $target_fullname exists in current package or
    # its dependents.
    my $target_ok = sub {
        my ($target) = @_;

        my $target_fullname = target_local_fullname($target, $dirname)
          // return 1;   # external or undef (no such attribute) are ok

        # "makeinfo" generates links up from a document to a "dir.html"
        # which is supposed to be a directory of all documents, the
        # equivalent of the info "/usr/share/info/dir" file.  dir.html
        # normally doesn't exist when a single document is formatted.  As a
        # special case ignore links to "dir.html".
        # (Believe there's nowhere very helpful a dir link could go.  Maybe
        # one of the doc-base document lists.  Maybe an option on "makeinfo"
        # not to generate such toplevel "up" link would be better.)
        if ($target_is_makeinfo{$target}) {
            return 1;
        }

        # Targets under /usr/share/doc/PACKAGENAME/ are checked.  Targets
        # elsewhere are too often cross-package links and tend to make a lot
        # of false positives (currently) since arbitrary other package
        # contents are not available to check.
        #
        # Targets /foo.html or /foo/bar.html are checked, since they're
        # likely to be leftover leading "/" from pages meant for a web
        # server.
        #
        unless ($target_fullname =~ $usr_share_doc_pkg_name_re
                || $target_fullname =~ $toplevel_filename_re) {
            return 1;
        }

        # Check target in our package.
        if ($info->index_resolved_path($target_fullname)) {
            return 1;
        }
        # Check target in our dependent packages:
        my $deps = $group->info->direct_dependencies($proc);
        foreach my $depproc (@$deps) {
            my $info = $depproc->info;
            if ($info->index_resolved_path($target_fullname)) {
                return 1;
            }
        }
        return 0;
    };

    my $start_handler = sub {
        my ($tagname, $attr) = @_;

        my $rel = lc($attr->{'rel'} // '');

        # <a href="foo.html"> should have foo.html existing
        if ($tagname eq 'a') {
            my $target = $attr->{'href'};
            if (! $target_ok->($target)) {
                tag 'html-missing-href-file', $file, $target;
            }
        }

        # <img src="foo.png"> should have foo.png
        # <audio src="foo.ogg"> should have foo.ogg
        # <video src="foo.ogv"> should have foo.ogv
        if ($tag_is_image{$tagname}) {
            my $target = $attr->{'src'};
            if (! $target_ok->($target)) {
                tag 'html-missing-image-file', $file, $target;
            }
        }

        # <link rel="stylesheet" href="foo.css"> should have foo.css
        if ($tagname eq 'link' && $rel_is_stylesheet{$rel}) {
            my $target = $attr->{'href'};
            if (! $target_ok->($target)) {
                tag 'html-missing-css-file', $file, $target;
            }
        }

        # <link rel="icon"          href="foo.ico"> should have foo.ico
        # <link rel="shortcut icon" href="foo.ico"> should have foo.ico
        if ($tagname eq 'link' && $rel_is_favicon{$rel}) {
            my $target = $attr->{'href'};
            if (! $target_ok->($target)) {
                tag 'html-missing-favicon-file', $file, $target;
            }
        }
    };

    my $parser = HTML::Parser->new
      (api_version => 3,
       start_h => [ $start_handler, 'tagname,attr' ]);
    # only call $start_handler for these tags
    $parser->report_tags('a',
                         'img',
                         'audio',  # HTML5
                         'video',  # HTML5
                         'link',
                        );

    foreach my $ifile ($info->sorted_index) {
        # Parse each HTML file in the package.
        # .html.gz is unusual, but for example in lynx-cur 2.8.9dev1-2+b1.
        # .xhtml is unusual, but for example in libapt-pkg-doc 1.0.9.7.
        if ($ifile =~ /\.x?html?(\.gz)?$/i && $ifile->is_file) {
            $file = $ifile;
            ($basename, $dirname) = fileparse($file);
            my $fh = ($file =~ /\.gz$/ ? $file->open_gz : $file->open);

            if (defined(my $non_bom = fh_possible_bom($fh))) {
                $parser->utf8_mode(1); # callback raw utf8 bytes for entities
                # give the parser the $non_bom bytes then the rest of the file
                $parser->parse($non_bom);
            } else {
                $parser->utf8_mode(0);
            }
            $parser->parse_file($fh);
        }
    }
    return;
}


my %bom_to_layer = ("\xFE\xFF" => ':encoding(UTF-16BE)',
                    "\xFF\xFE" => ':encoding(UTF-16LE)');

# Read the first two bytes of $fh looking for a unicode BOM.
# If found then push an ":encoding()" layer and return undef.
# If not found then return the 2 bytes read.
#
# BOM occurs for example in libxml-libxml-perl (version 2.0116+dfsg-1+b1)
# /usr/share/doc/libxml-libxml-perl/examples/utf-16-1.html
#
# HTML::Parser doesn't by itself look for bom or charset="".  We can get
# away with not decoding 8-bit supersets of ASCII, but for bigger must
# ensure HTML::Parser sees characters.  Failing to do so provokes warnings
# from HTML::Parser.
#
# ENHANCE-ME: Is there a PerlIO auto-BOM detection layer?  Something which
# noticed BOM and either replaced itself by :encoding or popped itself if no
# BOM.
#
sub fh_possible_bom {
    my ($fh) = @_;

    my $bytes;
    read($fh,$bytes,2);
    $bytes //= '';
    if (my $layer = $bom_to_layer{$bytes}) {
        binmode $fh, $layer;
        return undef;
    } else {
        return $bytes;  # not a BOM
    }
}

1;

# Local Variables:
# indent-tabs-mode: nil
# cperl-indent-level: 4
# End:
# vim: syntax=perl sw=4 sts=4 sr et
Check-Script: html
Type: binary
Needs-Info: unpacked, file-info
Info: This script checks HTML file content.

Tag: html-missing-image-file
Severity: normal
Certainty: possible
Info: HTML file missing an &lt;img&gt; or similar file.
 Generally a HTML file in a package should have its image files
 packaged too, and in the right place.
 .
 If an image is only some candy then missing it doesn't matter very
 much, but the aim would still be to have the packaged page look good.
 If an image is something important like a technical diagram then
 missing it might make the HTML almost useless.
 .
 If a logo or similar is not freely redistributable then it will be
 deliberately omitted.  Lintian can't distinguish that from mistaken
 omission.  If changing to an external image then usually a link &lt;a
 href=""&gt; is preferred over &lt;img src=""&gt;, to protect users'
 privacy.
 .
 If some HTML is a template then its links might not exist yet.
 Lintian can't distinguish that from links which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Beware absolute paths like src="/foo.png".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Currently only images under /usr/share/doc/PACKAGENAME/ are checked,
 to reduce false positives for cross-package targets.  Images supplied
 by a dependent package from the same source should work if checked as
 a group.

Tag: html-missing-css-file
Severity: normal
Certainty: possible
Info: HTML file missing a CSS stylesheet file.
 Generally a HTML file in a package should have its &lt;link
 rel="stylesheet"&gt; CSS files packaged too, in the right places.
 .
 A missing CSS usually leaves the html still readable, but not in the
 author's intended display style.
 .
 If some HTML is a template then its CSS might not exist yet.  Lintian
 can't distinguish that from things which ought to have been filled in
 but are not.  The suggestion would be to ignore reports on templates
 or add lintian overrides.
 .
 Beware absolute paths like href="/foo.css".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Currently only CSS under /usr/share/doc/PACKAGENAME/ are checked, to
 reduce false positives for cross-package targets.  CSS supplied by a
 dependent package from the same source should work if checked as a
 group.

Tag: html-missing-favicon-file
Severity: normal
Certainty: possible
Info: HTML file missing a favicon file
 in &lt;link rel="icon"&gt; or &lt;link rel="shortcut icon"&gt;.
 .
 A favicon is a small visual cue to the web page origin which a
 browser might display, and might record in bookmarks.  Missing the
 favicon does no harm, but if it can be included then it looks nice.
 .
 If the icon is not freely redistributable then it will be
 deliberately omitted.  Lintian can't distinguish that from mistaken
 omission.  If deliberately omitted then the suggestion is to remove
 its &lt;link&gt; uses too.  Changing to an external image is usually
 undesirable since an external fetch potentially compromises the
 user's privacy.
 .
 If some HTML is a template then its favicon might not exist yet.
 Lintian can't distinguish that from things which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Beware absolute paths like href="/foo.ico".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Currently only favicons under /usr/share/doc/PACKAGENAME/ are
 checked, to reduce false positives for cross-package targets.
 Favicons supplied by a dependent package from the same source should
 work if checked as a group.

Tag: html-missing-href-file
Severity: normal
Certainty: possible
Info: HTML file missing an &lt;a href=""&gt; link target.
 .
 If part of a document is not freely redistributable then it will be
 deliberately omitted.  Lintian can't distinguish that from mistaken
 omission.  If deliberately omitted then the suggestion is to change
 links to an external document at the project home page or similar.
 .
 If some HTML is a template then its link targets might not exist yet.
 Lintian can't distinguish that from things which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Generated documentation from makeinfo or pod2html often has links
 which presume a full set of manuals or classes is present, which may
 not be so in a single package.  The suggestion is if a targets are in
 other packages then amend to point to the right places there,
 otherwise perhaps some suitable external link.  If the doc tools
 can't make such redirections themselves then "sed" or similar
 post-processing may be necessary.
 .
 Beware absolute paths like href="/foo.html".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful since they can work under a
 different mount point etc.
 .
 Currently only targets under /usr/share/doc/PACKAGENAME/ are checked,
 to avoid false positives for cross-package links.  Targets supplied
 by a dependent package from the same source should work if checked as
 a group.
 .
 For reference, in the current code a target like "foo.html#section"
 checks only that foo.html exists, not also whether anchor "section"
 exists within foo.html.  Also a CGI style query part like
 "foo.html?some=thing" is taken to be target "foo.html".  Those query
 parts are meant for a server but iceape and perhaps other browsers
 may allow it locally.

Reply to: