Bug#796170: lintian: [new check] warn on non-UTF8 text files

To: debian-lint-maint@lists.debian.org
Cc: 796170@bugs.debian.org, Axel Beckert <abe@debian.org>
Subject: Bug#796170: lintian: [new check] warn on non-UTF8 text files
From: Adam Borowski <kilobyte@angband.pl>
Date: Sat, 22 Aug 2015 18:04:39 +0200
Message-id: <[🔎] 20150822160439.GA13484@angband.pl>
Reply-to: Adam Borowski <kilobyte@angband.pl>, 796170@bugs.debian.org
In-reply-to: <[🔎] 55D588A1.5090401@thykier.net>
References: <[🔎] 20150819214359.9385.53342.reportbug@umbar.angband.pl> <[🔎] 55D588A1.5090401@thykier.net>

On Thu, Aug 20, 2015 at 09:58:25AM +0200, Niels Thykier wrote:
> On 2015-08-19 23:43, Adam Borowski wrote:
> Thanks for the suggestion and the prototype patch.
> 
> I think it is an interesting proposal and I think we could try it as an
> experimental check to see if it is a feasible check.

I did some runs on the whole Debian archive, but having it in Lintian as an
experimental tag means it's available to other people in an usable/linkable
way.

> I have some minor technical remarks to the patch interleaved to below.
> It will also need a test case in the t/tests suite.

I'll try to learn how to write lintian test cases.

> > 0001-New-experimental-tag-text-file-uses-obsolete-encodin.patch

> > +            # ---------------- encoding
> > +            if (   $fname =~ m{^(?:usr/)?s?bin/}
> 
> Is this already guarded by a "$file->is_regular_file" (or
> $file->is_file)?  Otherwise, this will break horribly with lintian
> trying to open directories, named pipes, etc.

Yes, it's inside a $file->is_file block.

> > +                or $fname =~ m{\.(?:pm|py|pl|txt)$}
> 
> Minor nit:  Perhaps \.(?:p[myl]|txt(?:\.gz)?)$
>  * Would also test gzip-compressed text files
>  * p[myl] is slightly more optimial than pm|py|pl

Done.  I also added php, rb, tcl and sh to the list of extensions, probably
more should be added too.

> > +    while (<$fd>) {
> 
> I would like to see this use a named variable here (to keep the uses of
> "$_" down).

Done.

> > +        return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);
> 
> The $fd might be a pipe to a subprocess (in the open_gz case).  It would
> need a "drain_pipe" call to avoid spurious errors caused by gzip dying
> by "broken pipe".

Ok, I added drain_pipe($fd) if /...gzip/.

On Thu, Aug 20, 2015 at 12:09:26PM +0200, Jakub Wilk wrote:
] >+Tag: text-file-uses-obsolete-encoding
] 
] We have already a few tags named *-uses-obsolete-national-encoding, so it
] would be nice if this tag used the same scheme.

I wanted to keep the tag short, but you're probably right.  Done.

] Also, please make sure that we don't complain twice about the same file.

This one is harder.  There's several other checks on specific files (like
debian/changelog.Debian.gz) that differ by severity.  Should I hunt down all
such checks and add exclusions?

] >+Severity: normal
] 
] I'd say wishlist here.

Ok, it can always be jacked back up once I achieve world domination^W^W^Wget
it included as a part of a release goal.

] >+ characters (often called "mojibake").  You should convert it to UTF8 using
] >+ iconv or a similar tool.
] 
] iconv should be good for plain text files in /usr/share/doc; but if applied
] blindly to code, or HTML documents or similar, it can cause more harm than
] good.

"Blindly" can break when the code in question deals with obsolete encodings,
but otherwise, I disagree.  Using non-UTF-8 in HTML is always damage, which
leads to data loss once someone gives a form/etc input that's
unrepresentable in the encoding in question.  I've got enough of ń in my
town name mangled when ordering stuff, etc.  Unicode is the only charset
that's universal.

] It would be good to emphasise that the conversion should be done upstream,
] not by "you", the Debian maintainer.

I intend to submit a patch to debhelper that makes the whole process a
matter of writing a single value to a file under debian/ which would
massively reduce the amount of work in ~3000 affected packages.  This should
make it easy enough.  You're right that fixing encodings is best done by the
upstream, but if converting during package build is easy enough, there's no
reason to discourage that.

] >+Info: The given file is text but uses non-UTF8 encoding.
] 
] s/UTF8/UTF-8/, here and elsewhere.

Done.

] >+            # ---------------- encoding
] >+            if (   $fname =~ m{^(?:usr/)?s?bin/}
] 
] Also /usr/games?

/usr/games added.  I initially skipped it for simplicity (there's exactly
1 package failing), but you're right, it's better to be complete.

] Though I'm not sure what's the rationale for choosing these directories...

Text files in /*bin are all scripts rather than random data files, and thus
comments and messages included need all to be in UTF-8.

On Thu, Aug 20, 2015 at 10:05:06AM -0700, Russ Allbery wrote:
} The last time I looked at this in a policy context, the distribution
} included a few documentation files that were intentionally provided
} upstream in multiple different encodings.  In other words, there would be
} a README.sjis and a README.utf8, etc., side-by-side.  In those cases, it
} feels bad to have Lintian tag the README.sjis file and have maintainers
} possibly just not install it, when it might still be a convenience to some
} users.

We default to UTF-8 for a decade, there are already plans to drop support
for running with an obsolete locale.  Shipping these files is a waste of
space.

In any case, in the whole archive we have 3 packages with such files with
sjis inside the file name, 16 with koi8.

I wonder how to gauge if there are actually any users who run obsolete
encodings left.  I'll try something like downloading X thousands of recent
bug reports and gathering Locale: lines reportbug includes.

-- 
⢎⣉⠂⠠⠤⡀⣄⠤⡀⠠⡅⠀⠤⡧⠄⡄⠀⡄⠀⠀⠀⠠⡅⠀⡠⠤⠄⠀⠀⠀⢴⠍⠀⡠⠤⡀⣄⠤⡀⠀⠀⠀⠤⡧⠄⣇⠤⡀⡠⠤⡀⠀⠀⠀⡄⠀⡄⡠⠤⡀⠠⠤⡀⡇⡠⠄⠀⠀⠀
⠢⠤⠃⠪⠭⠇⠇⠀⠇⠀⠣⠀⠀⠣⠄⠨⠭⠃⠀⠀⠀⠀⠣⠀⠬⠭⠂⠀⠀⠀⠸⠀⠀⠣⠤⠃⠇⠀⠀⠀⠀⠀⠀⠣⠄⠇⠀⠇⠫⠭⠁⠀⠀⠀⠣⠣⠃⠫⠭⠁⠪⠭⠇⠏⠢⠄⠀⠄⠀
(https://github.com/kilobyte/braillefont for this hack)

>From d8eccaf039be53a61d5c2f0db995fa9fb875d317 Mon Sep 17 00:00:00 2001
From: Adam Borowski <kilobyte@angband.pl>
Date: Sat, 22 Aug 2015 16:30:11 +0200
Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
 checks/files.desc   | 11 +++++++++++
 checks/files.pm     | 11 ++++++++++-
 lib/Lintian/Util.pm | 26 ++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/checks/files.desc b/checks/files.desc
index 1deb2cc..b4c90c1 100644
--- a/checks/files.desc
+++ b/checks/files.desc
@@ -1631,6 +1631,17 @@ Info: The given file is in PATH but consists of non-ASCII characters.
  Note that Lintian may be unable to display the filename accurately.
  Unprintable characters may have been replaced.
 
+Tag: text-file-uses-obsolete-national-encoding
+Severity: wishlist
+Certainty: possible
+Experimental: yes
+Info: The given file is text but uses non-UTF-8 encoding.
+ .
+ Debian defaults to UTF-8 for a long time, and support for obsolete encodings
+ is being phased out.  Users trying to read this file will see mangled
+ characters (often called "mojibake").  You should convert it to UTF8 using
+ iconv or a similar tool.
+
 Tag: incorrect-naming-of-pkcs11-module
 Severity: important
 Certainty: certain
diff --git a/checks/files.pm b/checks/files.pm
index b816ed8..1966a74 100644
--- a/checks/files.pm
+++ b/checks/files.pm
@@ -27,7 +27,7 @@ use Lintian::Data;
 use Lintian::Output qw(warning);
 use Lintian::Tags qw(tag);
 use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz
-  signal_number2name strip normalize_pkg_path);
+  signal_number2name strip normalize_pkg_path file_is_non_utf8_text);
 use Lintian::SlidingWindow;
 
 use constant BLOCKSIZE => 16_384;
@@ -1514,6 +1514,15 @@ sub run {
                   if $info->index($1);
             }
 
+            # ---------------- encoding
+            if (   $fname =~ m{^(?:usr/)?s?bin/}
+                or $fname =~ m{^usr/games/}
+                or $fname =~ m{\.(?:p[myl]|php|rb|tcl|sh|txt)(?:\.gz)?$}
+                or $fname =~ m{^usr/share/doc}) {
+                tag 'text-file-uses-obsolete-national-encoding', $file
+                  if file_is_non_utf8_text($file);
+            }
+
             # ---------------- general: setuid/setgid files!
             if ($operm & 06000) {
                 my ($setuid, $setgid) = ('','');
diff --git a/lib/Lintian/Util.pm b/lib/Lintian/Util.pm
index 0b8fa5a..d25e70f 100644
--- a/lib/Lintian/Util.pm
+++ b/lib/Lintian/Util.pm
@@ -62,6 +62,7 @@ BEGIN {
           slurp_entire_file
           file_is_encoded_in_non_utf8
           is_string_utf8_encoded
+          file_is_non_utf8_text
           fail
           strip
           lstrip
@@ -859,6 +860,31 @@ sub file_is_encoded_in_non_utf8 {
     return $line;
 }
 
+=item file_is_non_utf8_text (...)
+
+Both binary files and text files encoded in proper UTF8 give a negative
+answer.
+
+=cut
+
+sub file_is_non_utf8_text {
+    my ($file) = @_;
+
+    my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : $file->open;
+    my $bad=0;
+    while (my $line = <$fd>) {
+        if ($line =~ m/[^\t\n\f\r -\x{ff}]/) {
+            $bad=0;
+            last;
+        }
+        $bad=1 if (!is_string_utf8_encoded($line));
+    }
+    drain_pipe($fd) if ($file->file_info =~ m/gzip compressed/);
+    close($fd);
+
+    return $bad;
+}
+
 =item system_env (CMD)
 
 Behaves like system (CMD) except that the environment of CMD is
-- 
2.1.4

Reply to:

Follow-Ups:
- Bug#796170: lintian: [new check] warn on non-UTF8 text files
  - From: Jakub Wilk <jwilk@debian.org>

References:
- Bug#796170: lintian: [new check] warn on non-UTF8 text files
  - From: Adam Borowski <kilobyte@angband.pl>
- Bug#796170: lintian: [new check] warn on non-UTF8 text files
  - From: Niels Thykier <niels@thykier.net>

Prev by Date: Bug#796562: lintian: Please identify lack of sanitation compiler/linker flags
Next by Date: Bug#796170: lintian: [new check] warn on non-UTF8 text files
Previous by thread: Bug#796170: lintian: [new check] warn on non-UTF8 text files
Next by thread: Bug#796170: lintian: [new check] warn on non-UTF8 text files
Index(es):
- Date
- Thread