[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#778955: lintian: suggest check html <img>s included in package



I added some css checking.  I couldn't find a full perl css parser I
liked but some regexps express enough lexical nature, I believe.

There seems only a few css problems on the whole, perhaps because it's
mercifully little used.  As an example

    W: bzip2-doc: css-missing-resource-file usr/share/doc/bzip2/manual.html /images/hr_blue.png

which is the inline css in manual.html where the background image has an
absolute path.  Dunno if its absence affects what's shown on screen. :-)

Even if not many css faults, the parse picks out urls for privacy breach
checking noted earlier.  (Eg. if the maintainer comments out offending
bits then they're skipped.)


# html -- lintian check script

# Copyright 2015 Kevin Ryde
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program.  If not go to <http://www.gnu.org/licenses/>.


# HTML::Parser measures faster than a rough regexp parse, and does a very
# much better job of distinguishing tags from strange text and ignoring
# <!-- comments -->.
#
# <link rev=""> is unchecked.  It's the conceptual reverse of rel="" but the
# only one found in practice is rev="made" giving an author email.  Don't
# think there's anything in them to be checked.
#
# Other Ideas:
#
# * Would like to check all link targets (not just those the package's own
#   /usr/share/doc/PACKAGENAME/ as currently done).  That means checking for
#   files in other packages.  Packages from the same source dist are in
#   $info->group->direct_dependencies($proc) if they're checked together,
#   but packages from different source dist are not.
#
#   An example cross-package image is texlive-lang-french (version
#   2014.20141024-1) where
#   /usr/share/doc/texlive-doc/texlive/texlive-fr/texlive-fr.html has
#   src="../texlive-common/install-lnx-main.png" which is from texlive-base
#   (a declared dependency).
#
#   An example cross-package css is imagemagick-doc (version 8:6.8.9.9-5)
#   where /usr/share/doc/imagemagick-doc/index.html uses
#   /usr/share/javascript/jquery-fancybox/jquery.fancybox.css from
#   libjs-jquery-fancybox (a declared dependency).
#
# * External images and css (but not href links) would be candidates for the
#   privacy checks currently in files.pm.  Could call to something there to
#   decide the flavour of badness and its solution.  Or maybe all URL
#   privacy matters could come here.  The HTML parse here does a good job
#   distinguishing things like rel="generator-home" which are informational,
#   not fetched, so not privacy breaches.
#
# * The parse here could notice <img width=1 height=1> web bugs as tracking
#   or privacy breaches.  But the only one of that sort of nonsense seen is
#   libspreadsheet-writeexcel-perl 2.40-1
#   /usr/share/doc/libspreadsheet-writeexcel-perl/html/index.html and it
#   already tickles the privacy check by its src="".
#
# * Could consider preferring or demanding images and/or css to be in local
#   copies, not an external fetch.  If the HTML can be packaged then
#   presumably its contained images and used css can be too.
#   Eg. /usr/share/doc/gcc-4.9-base/NEWS.html in gcc-4.9-base 4.9.2-10 uses
#   external http://gcc.gnu.org/gcc.css
#

package Lintian::html;
use 5.010;
use strict;
use warnings;

use File::Basename qw(fileparse);
use File::BOM;
use HTML::Parser;
use URI::Escape qw(uri_unescape);

use Lintian::Tags qw(tag);
use Lintian::Util qw(normalize_pkg_path);


# $target is the contents of an href="$target" link etc, or undef.
# If it's a link to a local file etc then return a full Lintian normalized
# pathname like "usr/share/doc/foo/something.html".
# If $target is undef or some external http:/ etc then return undef.
#
sub target_local_fullname {
    my ($target, $dirname) = @_;
    if (! defined $target) { return undef; }

    # Strip leading and trailing whitespace.
    # HTML4 spec "6.2 SGML basic types" says user agents may ignore
    # leading and trailing whitespace, and for example iceape does that.
    # Do the same here.
    # Example trailing whitespace is in snd-doc 11.7-3 where
    # /usr/share/doc/snd-doc/HTML/tutorial/2_custom_snd.html has
    # <img src="images/jpg/2_03-snd_horizontal.jpg ">
    $target =~ s/^\s+//;
    $target =~ s/\s+$//;

    # Usually schemas are lower case "ftp:" etc.
    # Lynx has some upper case private specials like "LYNXKEYMAP:".
    if ($target =~ /[a-z]+:/i) {
        # Have a "schema:" on the link.

        # Strip "file:" so that file:/foo.png becomes /foo.png.
        # This occurs in various href="" and is a local file to check.
        # The file: schema ought to be file:///foo.png, but is often just
        # file:/foo.png single slash.  All slashes are crunched by
        # normalize_pkg_path() below.
        #
        unless ($target =~ s{^file:}{}) {
            # Anything except file: is "http:" external or "data:" inline,
            # or "resource:" netsurf specific, etc, all of which are not
            # local files.
            return undef;
        }
    }

    # Strip anchor fragment part, so "foo.html#section" becomes "foo.html"
    $target =~ s/#.*$//;

    # Strip CGI query part so "foo.html?some=thing" becomes "foo.html".  Not
    # sure how many browsers accept this sort of thing in files (rather than
    # a server).  Iceape can do some things with it.
    $target =~ s/\?.*$//;

    # decode escapes %20 etc
    $target = uri_unescape($target);

    return normalize_pkg_path($dirname, $target);
}

my %tag_is_image = (img => 1,
                    audio => 1,
                    video => 1);

my %rel_is_stylesheet = (stylesheet => 1,
                         'alternate stylesheet' => 1);
my %rel_is_favicon = (icon => 1,
                      'shortcut icon' => 1);

my %target_is_makeinfo = ('dir.html#Top'      => 1, # single file
                          '../dir/index.html' => 1, # multi-file
                          '../DIR/index.html' => 1, # multi-file
                          '../index.html#dir' => 1, # makefile 4.13
                          'DIR.html#Top'      => 1,
                         );

# type="text/css" is standard.
# type="text/stylesheet" in python-pygame 1.9.1release+dfsg-10
# /usr/lib/python2.7/dist-packages/pygame/docs/tut/tom/games2.html
#
my %type_is_css = ('text/css' => 1,
                   'text/stylesheet' => 1);

sub run {
    my (undef, undef, $info, $proc, $group) = @_;

    my $pkg_name = $proc->pkg_name;
    my $own_package_filename_re = qr{^usr/share/doc/\Q$pkg_name/};

    # $file is the Lintian::Path of the html file being checked.
    # $dirname is its directory (and $basename its foo.html).
    # $dirname includes a trailing "/".
    #
    my ($file, $dirname, $basename);

    # Return true if $target is ok, which means either it exists in current
    # package or its dependents, or is something external or unchecked.
    my $target_ok = sub {
        my ($target) = @_;

        my $target_fullname = target_local_fullname($target, $dirname)
          // return 1;   # external or undef (no such attribute) are ok

        # "makeinfo" generates links up from a document to a "dir.html"
        # which is supposed to be a directory of all documents, the
        # equivalent of the info "/usr/share/info/dir" file.  dir.html
        # normally doesn't exist when a single document is formatted.  As a
        # special case ignore links to "dir.html".
        # (Believe there's nowhere very helpful a dir link could go.  Maybe
        # one of the doc-base document lists.  Maybe an option on "makeinfo"
        # not to generate such toplevel "up" link would be better.)
        if ($target_is_makeinfo{$target}) {
            return 1;
        }

        # To demand a target we want to be confident it ought to exist in
        # the present package or its same-source dependencies.  For now this
        # means any target under /usr/share/doc/PACKAGENAME/.
        #
        # Targets elsewhere are too often cross-package links which are
        # perfectly valid but can't be checked (currently) since lintian
        # doesn't have arbitrary other package contents available.
        #
        # Targets like /foo.html or /foo/bar.html not under /usr nor under
        # /etc are checked (and likely fail) since they're usually leftover
        # leading "/" from pages meant for a web server and which don't work
        # from a file.
        #
        # (Have considered checking targets under /etc/PACKAGENAME/ as ought
        # to be in the package, but unsure if they're more likely to be
        # templates.  The only failing such found is asciidoc 8.6.9-3 where
        # /etc/asciidoc/stylesheets/slidy.css refers to
        # ../graphics/nofold-dim.gif which does not exist there in /etc.)
        #
        if ($target_fullname =~ m{^(usr|etc)/}
            && ! ($target_fullname =~ $own_package_filename_re)) {
            # under /usr or /etc but not under /usr/share/doc/PACKAGENAME, skip
            return 1;
        }

        # Check target in our package.
        if ($info->index_resolved_path($target_fullname)) {
            return 1;
        }
        # Check target in our dependent packages:
        my $deps = $group->info->direct_dependencies($proc);
        foreach my $depproc (@$deps) {
            my $info = $depproc->info;
            if ($info->index_resolved_path($target_fullname)) {
                return 1;
            }
        }
        return 0;
    };

    # $str is CSS, either slurped from a .css file or HTML <style>.
    # Parse it for target URLs.
    my $css_check = sub {
        my ($str) = @_;

        my $prev_identifier = '';
        while ($str =~ m{ /\*  .*? (\*/|$)            # /* comment */
                         |<!-- .*? (-->|$)            # <!-- comment -->
                         |"(?<string> [^"]*) ("|$)    # "" string
                         |'(?<string> [^']*) ('|$)    # '' string
                         |url\("(?<url> [^"]*) ("|$)  # url("foo")
                         |url\('(?<url> [^']*) ('|$)  # url('foo')
                         |url\( (?<url> [^)]*)        # url(foo)
                         |(?<identifier> [^ \t\r\n'"/:]+)  # word foo or @foo
                      }xsgo) {
            if (defined $+{url}                           # any url()
                && $prev_identifier ne '@namespace') {    # except @namespace
                my $target = $+{url};
                if (! $target_ok->($target)) {
                    tag 'css-missing-resource-file', $file, $target;
                }
            } elsif (defined $+{string}
                     && $prev_identifier eq '@import') {   # @import "foo.css"
                my $target = $+{string};
                if (! $target_ok->($target)) {
                    tag 'css-missing-resource-file', $file, $target;
                }
            }
            $prev_identifier = $+{identifier} || '';
        }
    };

    my $text_handler = sub {
        my ($parser, $text) = @_;
        $css_check->($text);
        # stop further text callbacks, only <style> text wanted
        $parser->handler(text => undef);
    };
    my $start_handler = sub {
        my ($parser, $tagname, $attr) = @_;

        my $rel = lc($attr->{'rel'} // '');

        # <a href="foo.html"> should have foo.html exist
        if ($tagname eq 'a') {
            my $target = $attr->{'href'};
            if (! $target_ok->($target)) {
                tag 'html-missing-href-file', $file, $target;
            }
        }

        # <img src="foo.png"> should have foo.png
        # <audio src="foo.ogg"> should have foo.ogg
        # <video src="foo.ogv"> should have foo.ogv
        if ($tag_is_image{$tagname}) {
            my $target = $attr->{'src'};
            if (! $target_ok->($target)) {
                tag 'html-missing-image-file', $file, $target;
            }
        }

        # <link rel="stylesheet" href="foo.css"> should have foo.css
        if ($tagname eq 'link' && $rel_is_stylesheet{$rel}) {
            my $target = $attr->{'href'};
            if (! $target_ok->($target)) {
                tag 'html-missing-css-file', $file, $target;
            }
        }

        # <link rel="icon"          href="foo.ico"> should have foo.ico
        # <link rel="shortcut icon" href="foo.ico"> should have foo.ico
        if ($tagname eq 'link' && $rel_is_favicon{$rel}) {
            my $target = $attr->{'href'};
            if (! $target_ok->($target)) {
                tag 'html-missing-favicon-file', $file, $target;
            }
        }

        # <style type="text/css"> is inline CSS text
        my $type = lc($attr->{'type'} // '');
        if ($tagname eq 'style' && $type_is_css{$type}) {
            $parser->handler(text => $text_handler, 'self,dtext');
        }
    };

    my $parser = HTML::Parser->new
      (api_version => 3,
       start_h => [ $start_handler, 'self,tagname,attr' ]);
    # call $start_handler for the following tags
    $parser->report_tags('a',
                         'img',
                         'audio',  # HTML5
                         'video',  # HTML5
                         'link',
                         'style',  # CSS
                        );

    foreach my $ifile ($info->sorted_index) {
        # Parse each HTML file in the package.
        # .html.gz is unusual, but for example lynx-cur 2.8.9dev1-2+b1
        # or python-pmw 1.3.2-6.
        # .xhtml is unusual, but for example in libapt-pkg-doc 1.0.9.7.
        # BOM is unusual but for example libxml-libxml-perl 2.0116+dfsg-1+b1
        # /usr/share/doc/libxml-libxml-perl/examples/utf-16-1.html
        if ($ifile =~ /\.x?html?(\.gz)?$/i && $ifile->is_file) {
            $file = $ifile;
            ($basename, $dirname) = fileparse($file);
            my $fh = ($file =~ /\.gz$/ ? $file->open_gz : $file->open);

            my ($encoding, $spillage) = File::BOM::get_encoding_from_stream($fh);
            if ($encoding) {
                binmode $fh, ":encoding($encoding)";
                $parser->utf8_mode(0);
            } else {
                $parser->utf8_mode(1); # callback raw utf8 bytes for expanded entities
            }
            $parser->parse($spillage);
            $parser->parse_file($fh);
        }

        # Parse each CSS file in the package.
        # .css.gz is unusual, but for example xemacs21-basesupport 2009.02.17.dfsg.2-2
        if ($ifile =~ /\.css?(\.gz)?$/i && $ifile->is_file) {
            $file = $ifile;
            ($basename, $dirname) = fileparse($file);
            $css_check->($file->file_contents);
        }
    }
    return;
}

1;

# Local Variables:
# indent-tabs-mode: nil
# cperl-indent-level: 4
# End:
# vim: syntax=perl sw=4 sts=4 sr et
Check-Script: html
Type: binary
Needs-Info: unpacked, file-info
Info: This script checks HTML file content.

Tag: html-missing-image-file
Severity: normal
Certainty: possible
Info: HTML file missing an &lt;img&gt; or similar file.
 Generally a HTML file in a package should have its image files
 packaged too, and in the right place.
 .
 If an image is only some candy then missing it doesn't matter very
 much, but the aim would still be to have the packaged page look good.
 If an image is something important like a technical diagram then
 missing it might make the HTML almost useless.
 .
 If a logo or similar is not freely redistributable then it will be
 deliberately omitted from the deb.  Lintian can't distinguish that
 from mistaken omission.  If changing to an external image then
 usually a link &lt;a href=""&gt; is preferred over &lt;img src=""&gt;
 so that users' privacy is not compromised by always fetching an
 external resource.
 .
 If some HTML is a template then its links might not exist yet.
 Lintian can't distinguish that from links which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Beware absolute paths like src="/foo.png".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Currently only images under /usr/share/doc/PACKAGENAME/ are checked,
 to reduce false positives for cross-package targets.  Images supplied
 by a dependent package from the same source should work if checked as
 a group.

Tag: html-missing-css-file
Severity: normal
Certainty: possible
Info: HTML file missing a CSS stylesheet file.
 Generally a HTML file in a package should have its &lt;link
 rel="stylesheet"&gt; CSS files packaged too, in the right places.
 .
 A missing CSS usually leaves the html still readable, but not in the
 author's intended display style.
 .
 If some HTML is a template then its CSS might not exist yet.  Lintian
 can't distinguish that from things which ought to have been filled in
 but are not.  The suggestion would be to ignore reports on templates
 or add lintian overrides.
 .
 Beware absolute paths like href="/foo.css".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Currently only CSS under /usr/share/doc/PACKAGENAME/ are checked, to
 reduce false positives for cross-package targets.  CSS supplied by a
 dependent package from the same source should work if checked as a
 group.

Tag: html-missing-favicon-file
Severity: normal
Certainty: possible
Info: HTML file missing a favicon file
 in &lt;link rel="icon"&gt; or &lt;link rel="shortcut icon"&gt;.
 .
 A favicon is a small visual cue to the web page origin which a
 browser might display, and might record in bookmarks.  Missing the
 favicon does no harm, but if it can be included then it looks nice.
 .
 If the icon is not freely redistributable then it will be
 deliberately omitted from the deb.  Lintian can't distinguish that
 from mistaken omission.  If deliberately omitted then the suggestion
 is to remove its &lt;link&gt; uses too.  Changing to an external
 image is usually undesirable since an external fetch potentially
 compromises the user's privacy.
 .
 If some HTML is a template then its favicon might not exist yet.
 Lintian can't distinguish that from things which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Beware absolute paths like href="/foo.ico".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful so that a document is
 displayable under a different mount point etc.
 .
 Currently only favicons under /usr/share/doc/PACKAGENAME/ are
 checked, to reduce false positives for cross-package targets.
 Favicons supplied by a dependent package from the same source should
 work if checked as a group.

Tag: html-missing-href-file
Severity: normal
Certainty: possible
Info: HTML file missing an &lt;a href=""&gt; link target.
 .
 If part of a document is not freely redistributable then it will be
 deliberately omitted from the deb.  Lintian can't distinguish that
 from mistaken omission.  If deliberately omitted then the suggestion
 is to change links to an external document at the project home page
 or similar.
 .
 If some HTML is a template then its link targets might not exist yet.
 Lintian can't distinguish that from things which ought to have been
 filled in but are not.  The suggestion would be to ignore reports on
 templates or add lintian overrides.
 .
 Generated documentation from makeinfo or pod2html often has links
 which presume a full set of manuals or classes is present, which may
 not be so in a single package.  The suggestion is if a targets are in
 other packages then amend to point to the right places there,
 otherwise perhaps some suitable external link.  If the doc tools
 can't make such redirections themselves then "sed" or similar
 post-processing may be necessary.
 .
 Beware absolute paths like href="/foo.html".  This is common in HTML
 written for a web site but fails when copied elsewhere like a Debian
 package.  Relative links are more helpful since they can work under a
 different mount point etc.
 .
 Currently only targets under /usr/share/doc/PACKAGENAME/ are checked,
 to avoid false positives for cross-package links.  Targets supplied
 by a dependent package from the same source should work if checked as
 a group.
 .
 For reference, in the current code a target like "foo.html#section"
 checks only that foo.html exists, not also whether anchor "section"
 exists within foo.html.  Also a CGI style query part like
 "foo.html?some=thing" is taken to be target "foo.html".  Those query
 parts are meant for a server but iceape and perhaps other browsers
 may allow it locally.

Tag: css-missing-resource-file
Severity: normal
Certainty: possible
Info: CSS file or inline CSS missing target resource, such as a
 background:url() image, @import of further stylesheet, etc.
 .
 Usually stylesheet elements are only eye candy and a document should
 be readable without, but with some loss of quality.  The aim is to
 have a packaged document look the way the author intended.
 .
 If some sub-part of the CSS is not freely redistributable then it
 will be deliberately omitted from the deb.  Lintian can't distinguish
 that from mistaken omission.  If deliberately omitted then the
 suggestion is to remove its uses too.  Changing to an external
 resource is usually undesirable since fetching potentially
 compromises the user's privacy.
 .
 If some HTML or CSS is a template then its stylesheet targets might
 not exist yet.  Lintian can't distinguish that from things which
 ought to have been filled in but are not.  The suggestion would be to
 ignore reports on templates or add lintian overrides.
 .
 Beware absolute paths like background:url(/foo.png).  This is
 common in CSS written for a web site but fails when copied elsewhere
 like a Debian package.  Relative links are more helpful since they
 can work under a different mount point etc.
 .
 Note in a .css file that relative paths such as
 background:url(foo.png) are relative to the .css file, not to the
 document etc which might use the css.
 .
 Currently only targets under /usr/share/doc/PACKAGENAME/ are checked,
 to avoid false positives for cross-package resources.  Targets
 supplied by a dependent package from the same source should work if
 checked as a group.
 .
 Currently CSS is parsed by some regexps.  Believe it adequately
 understands comments and strings.  Report a bug if something unusual
 is mis-parsed.

Reply to: