[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#448783: lintian: more doc-base checks



Package: lintian
Version: 1.23.36
Severity: wishlist
Tags: patch

Hi,

I prepared patch which makes lintian more robust about contents 
of doc-base control files. The following obvious checks are included:
- missing required fields,
- unrecognised fields,
- spelling errors,
- duplicated fields or formats.

The patch also adds check for possible incorrectness of the
continuation lines in the Abstract field. Many packages put 
extra spaces in the front of the lines, causing the field 
to be incorrectly displayed verbatim by dwww or dhelp. The 
check is based on heuristic, but it seems it's correct (i.e.
there're no false positives).

Additionally I implemented checks for invalid characters in
the Document field and unnecessary spaces in separator lines.
The checks might be rather controversial since many control
files fail on them.  (Especially the "invalid characters" check
makes me think about allowing uppercase letters in the Document 
field.)

Please find the patch attached to this mail. The spelling_common.pm 
module was split out from the lintian/spelling check, the major
difference is for the additional argument of spelling_check() routine.

To check the changes I run both unpatched and patched `lintian -C menus' 
on almost all packages containing at least one doc-base file, and didn't 
find any errors.I put the logs at 
http://people.debian.org/~robert/lintian-doc-base-logs.tar.bz2 .

I would be grateful if you could apply the patch for lintian.

Best Regards,
robert



-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (990, 'unstable')
Architecture: i386 (i686)

Kernel: Linux 2.6.22
Locale: LANG=pl_PL, LC_CTYPE=pl_PL (charmap=ISO-8859-2)
Shell: /bin/sh linked to /bin/pdksh

Versions of packages lintian depends on:
ii  binutils            2.18.1~cvs20071027-1 The GNU assembler, linker and bina
ii  diffstat            1.45-2               produces graph of changes introduc
ii  dpkg-dev            1.14.7               package building tools for Debian
ii  file                4.21-3               Determines file type using "magic"
ii  gettext             0.16.1-2             GNU Internationalization utilities
ii  intltool-debian     0.35.0+20060710.1    Help i18n of RFC822 compliant conf
ii  libparse-debianchan 1.1.1-1              parse Debian changelogs and output
ii  man-db              2.5.0-3              on-line manual pager
ii  perl [libdigest-md5 5.8.8-11.1           Larry Wall's Practical Extraction 

lintian recommends no packages.

-- no debconf information
diff -Nur checks.old/menus checks/menus
--- checks.old/menus	2007-10-16 05:41:15.000000000 +0200
+++ checks/menus	2007-10-30 19:54:28.000000000 +0100
@@ -24,6 +24,7 @@
 use strict;
 use lib "$ENV{'LINTIAN_ROOT'}/checks/";
 use common_data;
+use spelling_common;
 use Tags;
 use Util;
 
@@ -31,6 +32,21 @@
 my %all_files = ();
 my %all_links = ();
 
+
+my %known_docbase_main_fields = ( 
+	'document' => 1,
+	'title'    => 1,
+	'section'  => 1,
+	'abstract' => 0,
+	'author'   => 0
+);
+my %known_docbase_format_fields = (
+	'format'  => 1,
+	'files'   => 1,
+	'index'   => 0
+);	
+
+
 sub run {
 
 $pkg = shift;
@@ -163,94 +179,7 @@
     while (my $dbfile = readdir DOCBASEDIR) {
 	# don't try to parse executables, plus we already warned about it
 	next if -x "doc-base/$dbfile";
-	open (IN, '<', "doc-base/$dbfile") or
-	    fail("cannot open doc-base file $dbfile for reading.");
-
-	# Check if files referenced by doc-base are included in the package.
-	# The Index field should refer to only one file without wildcards.
-	# The Files field is a whitespace-separated list of files and may
-	# contain wildcards.  We skip without validating wildcard patterns
-	# containing character classes since otherwise we'd need to deal with
-	# wildcards inside character classes and aren't there yet.
-	#
-	# Defer checking files until we've read all possible continuation
-	# lines for the field.	As a result, all tags will be reported on the
-	# last continuation line of the field, rather than possibly where the
-	# offending file name is.
-	my (@files, $field, $sawindex, $sawdocument, $format, $insection);
-	while (1) {
-	    $_ = <IN>;
-	    if ((!defined ($_) || /^\S/ || /^$/) && $field) {
-		# Figure out the right line number.  It's actually the
-		# previous line, since we read ahead for continuation lines,
-		# unless we're at the end of the file.
-		my $line = $. - 1 + (defined ($_) ? 0 : 1);
-		if ($field eq 'index' && @files > 1) {
-		    tag "doc-base-index-references-multiple-files", "$dbfile:$line";
-		}
-		for my $file (@files) {
-		    if ($file =~ m%^/usr/doc%) {
-			tag "doc-base-file-references-usr-doc", "$dbfile:$line";
-		    }
-		    my $realfile = delink ($file);
-
-		    # openoffice.org-dev-doc has thousands of files listed so
-		    # try to use the hash if possible.
-		    my $found;
-		    if ($realfile =~ /[*?]/) {
-			my $regex = quotemeta ($realfile);
-			unless ($field eq 'index') {
-			    next if $regex =~ /\[/;
-			    $regex =~ s%\\\*%[^/]*%g;
-			    $regex =~ s%\\\?%[^/]%g;
-			    $regex .= '/?';
-			}
-			$found = grep { /^$regex\z/ } keys %all_files;
-		    } else {
-			$found = $all_files{$realfile} || $all_files{"$realfile/"};
-		    }
-		    unless ($found) {
-			tag "doc-base-file-references-missing-file", "$dbfile:$line", $file;
-		    }
-		}
-		undef @files;
-		undef $field;
-	    }
-	    if (defined ($_) && /^(Index|Files)\s*:\s*(.*?)\s*$/i) {
-		$field = lc $1;
-		@files = split (' ', $2);
-		if ($field eq 'index') {
-		    $sawindex = 1;
-		}
-	    } elsif (defined ($_) && /^Format\s*:\s*(.*?)\s*$/i) {
-		$format = lc $1;
-		tag "doc-base-file-unknown-format", "$dbfile:$.", $format
-		    unless $known_doc_base_formats{$format};
-	    } elsif (defined ($_) && /^Document\s*:/i) {
-		$sawdocument = 1;
-                tag "doc-base-document-field-ends-in-whitespace", "$dbfile:$."
-                    if /[ \t]$/;
-	    } elsif (defined ($_) && /^\s/ && $field) {
-		push (@files, split ' ');
-	    }
-	    if (defined ($_) && /^\s*\S/) {
-		$insection = 1;
-	    }
-	    if (!defined ($_) || /^$/) {
-		tag "doc-base-file-no-format", "$dbfile:$."
-		    if ($insection && !($format || $sawdocument));
-		if ($format && ($format eq 'html' || $format eq 'info')) {
-		    tag "doc-base-file-no-index", "$dbfile:$."
-			unless $sawindex;
-		}
-		last unless defined $_;
-		undef $format;
-		undef $sawdocument;
-		undef $sawindex;
-		undef $insection;
-	    }
-	}
-	close IN;
+	check_doc_base_file($dbfile);
     }
     closedir DOCBASEDIR;
 } else {
@@ -285,6 +214,233 @@
 }
 
 # -----------------------------------
+#
+
+
+sub check_doc_base_file {
+  my $dbfile = shift;
+
+  open (IN, '<', "doc-base/$dbfile") or
+    fail("cannot open doc-base file $dbfile for reading.");
+
+  my (@files, $field, @vals, %sawfields, %sawformats);
+  my $knownfields=\%known_docbase_main_fields;
+  my $line    = 0;  # global
+  %sawfields  = (); # local for each section of control file
+  %sawformats = (); # global for control file
+
+  while (<IN>) {
+    chomp();
+
+    if (/^(\S+)\s*:\s*(.*)$/) { # new field
+      # check previous field, if we have any
+      check_doc_base_field($dbfile, $line, $field, \@vals, \%sawfields, \%sawformats, $knownfields)
+        if $field;
+
+      $field  = lc $1;
+      @vals   = ($2);
+      $line   = $.;
+
+    } elsif ($field && /^\s+\S/) { # continuation of previously defined field
+      push (@vals, $_);
+      $line  = $.;    # all tags will be reported on the last continuation line
+                      # of doc-base field
+
+
+    } elsif (/^(\s*)$/) { # sections' separator
+      tag "doc-base-file-separator-extra-whitespaces", "$dbfile:$." if $1;
+
+      next unless $field; # skip successive empty lines
+
+      # check previously defined field & section
+      check_doc_base_field($dbfile, $line, $field, \@vals, \%sawfields, \%sawformats, $knownfields);
+      check_doc_base_file_section($dbfile, $line+1, \%sawfields, \%sawformats, $knownfields);
+
+      # intialise variables for new section
+      undef $field;
+      undef $line;
+      @vals       = ();
+      %sawfields  = ();
+      $knownfields=\%known_docbase_format_fields; # each section except the first one is format section
+
+    } else {  # everything else is a syntax error
+      tag "doc-base-file-syntax-error", "$dbfile:$.";
+    }
+  }
+
+  # check the last field/section of the control file
+  if ($field) {
+    check_doc_base_field($dbfile, $line, $field, \@vals, \%sawfields, \%sawformats, $knownfields);
+    check_doc_base_file_section($dbfile, $line, \%sawfields, \%sawformats, $knownfields);
+  }
+
+  tag "doc-base-file-no-format-section", "$dbfile:$." unless %sawformats;
+
+  close IN;
+}
+
+
+# Checks one field of doc-base control file
+# $vals is array ref containing all lines of the field
+# Modifies $sawfields and $sawformats
+sub check_doc_base_field {
+  my ($dbfile, $line, $field, $vals, $sawfields, $sawformats, $knownfields) = @_;
+
+
+
+  tag "doc-base-file-unknown-field", "$dbfile:$line", "$field"
+    unless defined $knownfields->{$field};
+  tag "doc-base-file-duplicated-field", "$dbfile:$line", "$field"
+    if $sawfields->{$field};
+  $sawfields->{$field} = 1;
+
+# Index/Files field
+  if ($field eq 'index' or $field eq 'files') {
+    # Check if files referenced by doc-base are included in the package.
+    # The Index field should refer to only one file without wildcards.
+    # The Files field is a whitespace-separated list of files and may
+    # contain wildcards.  We skip without validating wildcard patterns
+    # containing character classes since otherwise we'd need to deal with
+    # wildcards inside character classes and aren't there yet.
+
+    my @files = map { split ('\s+', $_) } @$vals;
+
+    if ($field eq 'index' && @files > 1) {
+      tag "doc-base-index-references-multiple-files", "$dbfile:$line";
+    }
+    for my $file (@files) {
+      if ($file =~ m%^/usr/doc%) {
+        tag "doc-base-file-references-usr-doc", "$dbfile:$line";
+      }
+      my $realfile = delink ($file);
+
+      # openoffice.org-dev-doc has thousands of files listed so
+      # try to use the hash if possible.
+      my $found;
+      if ($realfile =~ /[*?]/) {
+        my $regex = quotemeta ($realfile);
+        unless ($field eq 'index') {
+          next if $regex =~ /\[/;
+          $regex =~ s%\\\*%[^/]*%g;
+          $regex =~ s%\\\?%[^/]%g;
+          $regex .= '/?';
+        }
+        $found = grep { /^$regex\z/ } keys %all_files;
+      } else {
+        $found = $all_files{$realfile} || $all_files{"$realfile/"};
+      }
+      unless ($found) {
+        tag "doc-base-file-references-missing-file", "$dbfile:$line", $file;
+      }
+    }
+   undef @files;
+
+# Format field
+  } elsif ($field eq 'format') {
+    my $format = join (' ', @$vals);
+    $format =~ s/^\s+//o;
+    $format =~ s/\s+$//o;
+    $format = lc $format;
+
+    tag "doc-base-file-unknown-format", "$dbfile:$line", $format
+      unless $known_doc_base_formats{$format};
+    tag "doc-base-file-duplicated-format", "$dbfile:$line", $format
+      if $sawformats->{$format};
+    $sawformats->{$format} = 1;
+    # save the current format for the later section check
+    $sawformats->{' *current* '} = $format;
+
+# Document field
+  } elsif ($field eq 'document') {
+    $_ = join (' ', @$vals);
+
+    tag "doc-base-invalid-document-field", "$dbfile:$line", "$_"
+      unless /^[a-z0-9+.-]+$/;
+    tag "doc-base-document-field-ends-in-whitespace", "$dbfile:$line"
+      if /[ \t]$/;
+    tag "doc-base-document-field-not-in-first-line", "$dbfile:$line"
+      unless $line == 1;
+
+# Title field
+  } elsif ($field eq 'title') {
+
+    spelling_check("spelling-error-in-doc-base-title-field", join (' ', @$vals), "$dbfile:$line")
+      if @$vals;
+
+# Abstract field
+  } elsif ($field eq 'abstract') {
+
+
+    # The three following variables are used for checking if the field is correctly phrased.
+    # We detect if each line (except for the first line and lines containing single dot)
+    # of the field starts with the same number of spaces, not followed by the same non-space
+    # character, and the number of spaces is > 1.
+    #
+    # We try to match fields like this:
+    #  ||Abstract: The Boost web site provides free peer-reviewed portable
+    #  ||  C++ source libraries.  The emphasis is on libraries which work
+    #  ||  well with the C++ Standard Library.  One goal is to establish
+    # but not like this:
+    #  ||Abstract:  This is "Ding"
+    #  ||  * a dictionary lookup program for Unix,
+    #  ||  * DIctionary Nice Grep,
+    my $leadsp           = undef; # string with leading spaces from the second line
+    my $charafter        = undef; # first non-whitespace char of the second line
+    my $leadsp_different = 1;     # are spaces OK?
+
+    for my $idx (1 .. $#{@$vals}) { # intentionally skipping the first line
+      $_ = $vals->[$idx];
+      if (/manage\s+online\s+manuals\s.*Debian/o) {
+        tag "doc-base-abstract-field-is-template", "$dbfile:$line" unless $pkg eq "doc-base";
+
+      } elsif (/^(\s+)\.(\s*)$/o) {
+        tag "doc-base-abstract-field-separator-extra-whitespaces", "$dbfile:" . ($line - $#{@$vals} + $idx)
+          if $1 ne " " || $2;
+
+      } elsif (!$leadsp && /^(\s+)(\S)/o) { # the regexp should always match
+        ($leadsp, $charafter) = ($1, $2);
+        $leadsp_different     = $leadsp eq " ";
+
+      } elsif (!$leadsp_different && /^(\s+)(\S)/o) { # the regexp should always match
+      	undef $charafter if $charafter && $charafter ne $2;
+        $leadsp_different     = 1 if ($1 ne $leadsp)
+                                    or ($1 eq $leadsp  && $charafter);
+      }
+    }
+    tag "doc-base-abstract-might-contain-extra-leading-whitespaces", "$dbfile:$line"
+      unless $leadsp_different;
+
+    spelling_check("spelling-error-in-doc-base-abstract-field", join (' ', @$vals), "$dbfile:$line")
+      if @$vals;
+
+ }
+}
+
+
+# Checks section of doc-base control file
+# Tries to find required fields missing in the section
+sub check_doc_base_file_section {
+  my ($dbfile, $line, $sawfields, $sawformats, $knownfields) = @_;
+
+  tag "doc-base-file-no-format", "$dbfile:$line"
+    if (defined $sawfields->{'files'} || defined $sawfields->{'index'})
+      && ! (defined $sawfields->{'format'});
+
+  if ($sawfields->{'format'}) {
+    my $format =  $sawformats->{' *current* '}; # set by check_doc_base_field
+
+    tag "doc-base-file-no-index", "$dbfile:$line"
+      if $format && ($format eq 'html' || $format eq 'info')
+         && !$sawfields->{'index'};
+  }
+
+  map { tag "doc-base-file-lacks-required-field", "$dbfile:$line", "$_"
+      if $knownfields->{$_} == 1 && !$sawfields->{$_}
+    } sort (keys %$knownfields);
+}
+
+
+
 
 # Add file and link to %all_files and %all_links.  Note that both files and
 # links have to include a leading /.
diff -Nur checks.old/menus.desc checks/menus.desc
--- checks.old/menus.desc	2007-10-15 06:14:24.000000000 +0200
+++ checks/menus.desc	2007-10-30 19:52:30.000000000 +0100
@@ -162,7 +162,7 @@
 Info: The Index field in a doc-base file should reference the single index
  file for that document.  Any other files belonging to the same document
  should be listed in the Files field.
-Ref: Debian doc-base Manual section 2.3
+Ref: Debian doc-base Manual section 2.3.2.2
 
 Tag: doc-base-file-references-missing-file
 Type: error
@@ -176,7 +176,7 @@
 Info: The Format field in this doc-base control file declares a format
  that is not supported.  Recognized formats are "HTML", "Text", "PDF",
  "PostScript", "Info", "DVI", and "DebianDoc-SGML" (case-insensitive).
-Ref: Debian doc-base Manual section 2.3
+Ref: Debian doc-base Manual section 2.3.2.2
 
 Tag: doc-base-file-no-format
 Type: error
@@ -184,6 +184,12 @@
  format.  Each section after the first must specify a format.
 Ref: Debian doc-base Manual section 2.3.2.2
 
+Tag: doc-base-file-no-format-section
+Type: error
+Info: This doc-base control file didn't specify any format
+ section.
+Ref: Debian doc-base Manual section 2.3.2.2
+
 Tag: doc-base-file-no-index
 Type: error
 Info: Format sections in doc-base control files for HTML or Info documents
@@ -198,3 +204,85 @@
  doc-base (at least as of 0.8.5) cannot cope with such fields and
  debhelper 5.0.57 or earlier may create files ending in whitespace when
  installing such files.
+
+Tag: doc-base-document-field-not-in-first-line
+Type: error
+Info: The Document field in doc-base control file must be located at
+ first line of the file.  While unregistering documents, doc-base 0.8
+ and later parses only the first line of the control file for performance
+ reason.
+Ref: Debian doc-base Manual section 2.3.2.1
+
+Tag: doc-base-file-unknown-field
+Type: error
+Info: The doc-base control file contains field which is either unknown
+ or not valid for the section where was found.  Possible reasons for this
+ error are: a typo in field name, missing empty line between control file
+ sections, or an extra empty line separating sections.
+Ref: Debian doc-base Manual sections 2.3.2.1 and 2.3.2.2
+
+Tag: doc-base-file-duplicated-field
+Type: error
+Info: The doc-base control file contains duplicated field.
+
+Tag: doc-base-file-duplicated-format
+Type: error
+Info: The doc-base control file contains duplicated format.
+ Doc-base files must not register different documents in
+ one control file.
+Ref: Debian doc-base Manual section 2.3.2.2
+
+Tag: doc-base-file-lacks-required-field
+Type: error
+Info: The doc-base control file does not contain required field for
+ the appropriate section.
+Ref: Debian doc-base Manual sections 2.3.2.1 and 2.3.2.2
+
+Tag: doc-base-invalid-document-field
+Type: error
+Info: The Document field should consists only of letters (a-z), digits (0-9), plus (+)
+ or minus (-) signs, and dots (.)
+Ref: Debian doc-base Manual section 2.2
+
+Tag: doc-base-abstract-field-is-template
+Type: warning
+Info: The Abstract field of doc-base contains a "manage online manuals" phrase,
+which was copied verbatim from an example control file found in section 2.3.1
+of the Debian doc-base Manual.
+
+Tag: doc-base-abstract-might-contain-extra-leading-whitespaces
+Type: warning
+Info: Continuation lines of the Abstract field of doc-base control file
+should start with only one space, unless they are meant to be displayed
+verbatim by fontends.
+Ref: Debian doc-base Manual section 2.3.2
+
+Tag: doc-base-abstract-field-separator-extra-whitespaces
+Type: warning
+Info: Unnecessary spaces were found in the paragraph separator line
+of the doc-base's Abstract field.  The separator line should consist
+of a single space followed by a single dot.
+Ref: Debian doc-base Manual section 2.3.2
+
+Tag: spelling-error-in-doc-base-title-field
+Type: error
+Info: Lintian found a spelling error in the Title field of doc-base
+ control file.  Lintian has a list of common misspellings that it
+ looks for; it does not have a dictionary like a spelling checker does.
+
+Tag: spelling-error-in-doc-base-abstract-field
+Type: error
+Info: Lintian found a spelling error in the Abstract field of doc-base
+ control file.  Lintian has a list of common misspellings that it
+ looks for; it does not have a dictionary like a spelling checker does.
+
+Tag: doc-base-file-syntax-error
+Type: error
+Info: Lintian found a syntax error in the doc-base control file.
+Ref: Debian doc-base Manual section 2.3.2.2
+
+Tag: doc-base-file-separator-extra-whitespaces
+Type: warning
+Info: Unnecessary spaces were found in the doc-base file sections'
+ separator. The section separator is an empty line.
+Ref: Debian doc-base Manual section 2.3.2
diff -Nur checks.old/spelling_common.pm checks/spelling_common.pm
--- checks.old/spelling_common.pm	1970-01-01 01:00:00.000000000 +0100
+++ checks/spelling_common.pm	2007-10-29 20:26:06.000000000 +0100
@@ -0,0 +1,356 @@
+# spelling -- lintian check script -*- perl -*-
+
+# Look for common spelling errors in the package description and the
+# copyright file.
+
+# Copyright (C) 1998 Richard Braakman
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, you can find it on the World Wide
+# Web at http://www.gnu.org/copyleft/gpl.html, or write to the Free
+# Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
+# MA 02110-1301, USA.
+
+package spelling_common;
+use strict;
+use Tags;
+
+use base qw(Exporter);
+our @EXPORT = qw(spelling_check);     
+
+# All spelling errors that have been observed "in the wild" in package
+# descriptions are added here, on the grounds that if they occurred
+# once they are more likely to occur again.
+
+# Misspellings of "compatibility", "separate", and "similar" are 
+# particularly common.
+
+# Be careful with corrections that involve punctuation, since the check
+# is a bit rough with punctuation.  For example, I had to delete the
+# correction of "builtin" to "built-in".
+
+my %corrections = qw(
+		     accesnt accent
+		     accelleration acceleration
+		     accessable accessible
+		     accomodate accommodate
+		     acess access
+		     acording according
+		     additionaly additionally
+		     adress address
+		     adresses addresses
+		     adviced advised
+		     albumns albums
+		     alegorical allegorical
+		     algorith algorithm
+		     allpication application
+		     altough although
+		     alows allows
+		     amoung among
+		     amout amount
+		     analysator analyzer
+		     ang and
+		     appropiate appropriate
+		     arraival arrival
+		     artifical artificial
+		     artillary artillery
+		     attemps attempts
+		     authentification authentication
+		     automaticly automatically
+		     automatize automate
+		     automatized automated
+		     automatizes automates
+		     auxilliary auxiliary
+		     availavility availability
+		     availble available
+		     avaliable available
+		     availiable available
+		     backgroud background
+		     baloons balloons
+		     becomming becoming
+		     becuase because
+		     calender calendar
+		     cariage carriage
+		     challanges challenges
+		     changable changeable
+		     charachters characters
+		     charcter character
+		     choosen chosen
+		     colorfull colorful
+		     comand command
+		     commerical commercial
+		     comminucation communication
+		     commoditiy commodity
+		     compability compatibility
+		     compatability compatibility
+		     compatable compatible
+		     compatibiliy compatibility
+		     compatibilty compatibility
+		     compleatly completely
+		     complient compliant
+		     compres compress
+		     containes contains
+		     containts contains
+		     contence contents
+		     continous continuous
+		     contraints constraints
+		     convertor converter
+		     convinient convenient
+		     cryptocraphic cryptographic
+		     deamon daemon
+		     debain Debian
+		     debians Debian\'s
+		     decompres decompress
+		     definate definite
+		     definately definitely
+		     dependancies dependencies
+		     dependancy dependency
+		     dependant dependent
+		     developement development
+		     developped developed
+		     deveolpment development
+		     devided divided
+		     dictionnary dictionary
+		     diplay display
+		     disapeared disappeared
+		     dissapears disappears
+		     documentaion documentation
+		     docuentation documentation
+		     documantation documentation
+		     dont don\'t
+		     easilly easily
+		     ecspecially especially
+		     edditable editable
+		     editting editing
+		     eletronic electronic
+		     enchanced enhanced
+		     encorporating incorporating
+		     enlightnment enlightenment
+		     enterily entirely
+		     enviroiment environment
+		     environement environment
+		     excellant excellent
+		     exlcude exclude
+		     exprimental experimental
+		     extention extension
+		     failuer failure
+		     familar familiar
+		     fatser faster
+		     fetaures features
+		     forse force
+		     fortan fortran
+		     framwork framework
+		     fuction function
+		     fuctions functions
+		     functionnality functionality
+		     functonality functionality
+		     functionaly functionally
+		     futhermore furthermore
+		     generiously generously
+		     grahical graphical
+		     grahpical graphical
+		     grapic graphic
+		     guage gauge
+		     halfs halves
+		     heirarchically hierarchically
+		     helpfull helpful
+		     hierachy hierarchy
+		     hierarchie hierarchy
+		     howver however
+		     implemantation implementation
+		     incomming incoming
+		     incompatabilities incompatibilities
+		     indended intended
+		     indendation indentation
+		     independant independent
+		     informatiom information
+		     initalize initialize
+		     inofficial unofficial
+		     integreated integrated
+		     integrety integrity
+		     integrey integrity
+		     intendet intended
+		     interchangable interchangeable
+		     intermittant intermittent
+		     jave java
+		     langage language
+		     langauage language
+		     langugage language
+		     lauch launch
+		     lesstiff lesstif
+		     libaries libraries
+		     libary library
+		     licenceing licencing
+		     loggin login
+		     logile logfile
+		     loggging logging
+		     maintainance maintenance
+		     maintainence maintenance
+		     makeing making
+		     managable manageable
+		     manoeuvering maneuvering
+		     mathimatic mathematic
+		     mathimatics mathematics
+		     mathimatical mathematical
+		     ment meant
+		     modulues modules
+		     monochromo monochrome
+		     multidimensionnal multidimensional
+		     navagating navigating
+		     nead need
+		     neccesary necessary
+		     neccessary necessary
+		     necesary necessary
+		     nescessary necessary
+		     noticable noticeable
+		     optionnal optional
+		     orientatied orientated
+		     orientied oriented
+		     pacakge package
+		     pachage package
+		     packacge package
+		     packege package
+		     packge package
+		     pakage package
+		     particularily particularly
+		     persistant persistent
+		     plattform platform
+		     ploting plotting
+		     protable portable
+		     posible possible
+		     powerfull powerful
+		     prefered preferred
+		     prefferably preferably
+		     prepaired prepared
+		     princliple principle
+		     priorty priority
+		     proccesors processors
+		     proces process
+		     processsing processing
+		     processessing processing
+		     progams programs
+		     programers programmers
+		     programm program
+		     programms programs
+		     promps prompts
+		     pronnounced pronounced
+		     prononciation pronunciation
+		     pronouce pronounce
+		     protcol protocol
+		     protocoll protocol
+		     recieve receive
+		     recieved received
+		     redircet redirect
+		     regulamentations regulations
+		     remoote remote
+		     repectively respectively
+		     replacments replacements
+		     requiere require
+		     runnning running
+		     safly safely
+		     savable saveable
+		     searchs searches
+		     separatly separately
+		     seperate separate
+		     seperated separated
+		     seperately separately
+		     seperatly separately
+		     serveral several
+		     setts sets
+		     similiar similar
+		     simliar similar
+		     speach speech
+		     splitted split
+		     standart standard
+		     staically statically
+		     staticly statically
+		     succesful successful
+		     succesfully successfully
+		     suplied supplied
+		     suport support
+		     suppport support
+		     supportin supporting
+		     synchonized synchronized
+		     syncronize synchronize
+		     syncronizing synchronizing
+		     syncronus synchronous
+		     syste system
+		     sythesis synthesis
+		     taht that
+		     throught through
+		     useable usable
+		     usefull useful
+		     usera users
+		     usetnet Usenet
+		     utilites utilities
+		     utillities utilities
+		     utilties utilities
+		     utiltity utility
+		     utitlty utility
+		     variantions variations
+		     varient variant
+		     verson version
+		     vicefersa vice-versa
+		     yur your
+		     wheter whether
+		     wierd weird
+		     xwindows X
+		    );
+# The format above doesn't allow spaces
+$corrections{'alot'} = 'a lot';
+
+my %corrections_language_names = qw(
+				    english English
+				    french French
+				    german German
+				    russian Russian
+				   );
+
+# -----------------------------------
+
+sub _tag {
+    my @args = grep { defined($_)} @_;
+    tag(@args);
+}
+
+sub spelling_check {
+    my $tag = shift;
+    my $file = shift;
+    my $filename = shift;
+
+
+    foreach my $word (split(/\s+/, $file)) {
+	# before lowercasing the word, check if it's a non-uppercased
+	# language name
+	if (exists $corrections_language_names{$word}) {
+	    _tag($tag, $filename, $word, $corrections_language_names{$word});
+        }
+	$word = lc $word;
+	# try deleting the non-alphabetic parts from the word.
+	# Treat apostrophes specially: only delete them if they occur
+	# at the beginning or end of the word.
+	$word =~ s/(^\')|[^\w\xc0-\xd6\xd8-\xf6\xf8-\xff\']+|(\'$)//g;
+	if (exists $corrections{$word}) {
+	    _tag($tag, $filename, $word, $corrections{$word});
+        }
+    }
+    # special case for correcting a multi-word string
+    # $corrections{'Debian/GNU Linux'} = 'Debian GNU/Linux';
+    if ($file =~ m,Debian/GNU Linux,) {
+	_tag($tag, $filename, "Debian/GNU Linux", "Debian GNU/Linux");
+    }
+}
+
+1;
+
+# vim: syntax=perl

Attachment: signature.asc
Description: Digital signature


Reply to: