[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#796170: lintian: [new check] warn on non-UTF8 text files



Package: lintian
Version: 2.5.36.1
Severity: wishlist
Tags: patch


Here's an experimental tag, a step towards elimination of mojibake
system-wide.  It checks all text files in *bin/, /usr/share/doc/ and those
that look like a script file.  "Text" is defined as not having any bytes in
the 0..31 range other than tabs, newlines (incl. Windows ones) or form
feeds.  In practice, this definition appears to work pretty well, although
the list of files that should be skipped despite being text needs work.

It's a part of the "UTF-8 everywhere" release goal that I intend to
re-propose for Stretch.

This is only a preliminary version, let's discuss what you think.  If you're
on DebConf, you can contact me in person.
>From 902283f122c71c88b968abfc3c778686200c9361 Mon Sep 17 00:00:00 2001
From: Adam Borowski <kilobyte@angband.pl>
Date: Wed, 19 Aug 2015 23:32:39 +0200
Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
 checks/files.desc   | 11 +++++++++++
 checks/files.pm     | 10 +++++++++-
 lib/Lintian/Util.pm | 22 ++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/checks/files.desc b/checks/files.desc
index 1deb2cc..2c07021 100644
--- a/checks/files.desc
+++ b/checks/files.desc
@@ -1631,6 +1631,17 @@ Info: The given file is in PATH but consists of non-ASCII characters.
  Note that Lintian may be unable to display the filename accurately.
  Unprintable characters may have been replaced.
 
+Tag: text-file-uses-obsolete-encoding
+Severity: normal
+Certainty: possible
+Experimental: yes
+Info: The given file is text but uses non-UTF8 encoding.
+ .
+ Debian defaults to UTF8 for a long time, and support for obsolete encodings
+ is being phased out.  Users trying to read this file will see mangled
+ characters (often called "mojibake").  You should convert it to UTF8 using
+ iconv or a similar tool.
+
 Tag: incorrect-naming-of-pkcs11-module
 Severity: important
 Certainty: certain
diff --git a/checks/files.pm b/checks/files.pm
index b816ed8..0940bbf 100644
--- a/checks/files.pm
+++ b/checks/files.pm
@@ -27,7 +27,7 @@ use Lintian::Data;
 use Lintian::Output qw(warning);
 use Lintian::Tags qw(tag);
 use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz
-  signal_number2name strip normalize_pkg_path);
+  signal_number2name strip normalize_pkg_path file_is_non_utf8_text);
 use Lintian::SlidingWindow;
 
 use constant BLOCKSIZE => 16_384;
@@ -1514,6 +1514,14 @@ sub run {
                   if $info->index($1);
             }
 
+            # ---------------- encoding
+            if (   $fname =~ m{^(?:usr/)?s?bin/}
+                or $fname =~ m{\.(?:pm|py|pl|txt)$}
+                or $fname =~ m{^usr/share/doc}) {
+                tag 'text-file-uses-obsolete-encoding', $file
+                  if file_is_non_utf8_text($file);
+            }
+
             # ---------------- general: setuid/setgid files!
             if ($operm & 06000) {
                 my ($setuid, $setgid) = ('','');
diff --git a/lib/Lintian/Util.pm b/lib/Lintian/Util.pm
index 0b8fa5a..f68ea71 100644
--- a/lib/Lintian/Util.pm
+++ b/lib/Lintian/Util.pm
@@ -62,6 +62,7 @@ BEGIN {
           slurp_entire_file
           file_is_encoded_in_non_utf8
           is_string_utf8_encoded
+          file_is_non_utf8_text
           fail
           strip
           lstrip
@@ -859,6 +860,27 @@ sub file_is_encoded_in_non_utf8 {
     return $line;
 }
 
+=item file_is_non_utf8_text (...)
+
+Both binary files and text files encoded in proper UTF8 give a negative
+answer.
+
+=cut
+
+sub file_is_non_utf8_text {
+    my ($file) = @_;
+
+    my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : $file->open;
+    my $bad=0;
+    while (<$fd>) {
+        return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);
+        $bad=1 if (!is_string_utf8_encoded($_));
+    }
+    close($fd);
+
+    return $bad;
+}
+
 =item system_env (CMD)
 
 Behaves like system (CMD) except that the environment of CMD is
-- 
2.1.4


Reply to: