[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#796170: marked as done (lintian: warn on non-UTF8 text files)



Your message dated Wed, 03 Jun 2020 09:49:11 +0000
with message-id <E1jgQ1H-0007OH-ID@fasolo.debian.org>
and subject line Bug#796170: fixed in lintian 2.80.0
has caused the Debian Bug report #796170,
regarding lintian: warn on non-UTF8 text files
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
796170: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796170
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: lintian
Version: 2.5.36.1
Severity: wishlist
Tags: patch


Here's an experimental tag, a step towards elimination of mojibake
system-wide.  It checks all text files in *bin/, /usr/share/doc/ and those
that look like a script file.  "Text" is defined as not having any bytes in
the 0..31 range other than tabs, newlines (incl. Windows ones) or form
feeds.  In practice, this definition appears to work pretty well, although
the list of files that should be skipped despite being text needs work.

It's a part of the "UTF-8 everywhere" release goal that I intend to
re-propose for Stretch.

This is only a preliminary version, let's discuss what you think.  If you're
on DebConf, you can contact me in person.
>From 902283f122c71c88b968abfc3c778686200c9361 Mon Sep 17 00:00:00 2001
From: Adam Borowski <kilobyte@angband.pl>
Date: Wed, 19 Aug 2015 23:32:39 +0200
Subject: [PATCH] New experimental tag: text-file-uses-obsolete-encoding

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
 checks/files.desc   | 11 +++++++++++
 checks/files.pm     | 10 +++++++++-
 lib/Lintian/Util.pm | 22 ++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/checks/files.desc b/checks/files.desc
index 1deb2cc..2c07021 100644
--- a/checks/files.desc
+++ b/checks/files.desc
@@ -1631,6 +1631,17 @@ Info: The given file is in PATH but consists of non-ASCII characters.
  Note that Lintian may be unable to display the filename accurately.
  Unprintable characters may have been replaced.
 
+Tag: text-file-uses-obsolete-encoding
+Severity: normal
+Certainty: possible
+Experimental: yes
+Info: The given file is text but uses non-UTF8 encoding.
+ .
+ Debian defaults to UTF8 for a long time, and support for obsolete encodings
+ is being phased out.  Users trying to read this file will see mangled
+ characters (often called "mojibake").  You should convert it to UTF8 using
+ iconv or a similar tool.
+
 Tag: incorrect-naming-of-pkcs11-module
 Severity: important
 Certainty: certain
diff --git a/checks/files.pm b/checks/files.pm
index b816ed8..0940bbf 100644
--- a/checks/files.pm
+++ b/checks/files.pm
@@ -27,7 +27,7 @@ use Lintian::Data;
 use Lintian::Output qw(warning);
 use Lintian::Tags qw(tag);
 use Lintian::Util qw(drain_pipe fail is_string_utf8_encoded open_gz
-  signal_number2name strip normalize_pkg_path);
+  signal_number2name strip normalize_pkg_path file_is_non_utf8_text);
 use Lintian::SlidingWindow;
 
 use constant BLOCKSIZE => 16_384;
@@ -1514,6 +1514,14 @@ sub run {
                   if $info->index($1);
             }
 
+            # ---------------- encoding
+            if (   $fname =~ m{^(?:usr/)?s?bin/}
+                or $fname =~ m{\.(?:pm|py|pl|txt)$}
+                or $fname =~ m{^usr/share/doc}) {
+                tag 'text-file-uses-obsolete-encoding', $file
+                  if file_is_non_utf8_text($file);
+            }
+
             # ---------------- general: setuid/setgid files!
             if ($operm & 06000) {
                 my ($setuid, $setgid) = ('','');
diff --git a/lib/Lintian/Util.pm b/lib/Lintian/Util.pm
index 0b8fa5a..f68ea71 100644
--- a/lib/Lintian/Util.pm
+++ b/lib/Lintian/Util.pm
@@ -62,6 +62,7 @@ BEGIN {
           slurp_entire_file
           file_is_encoded_in_non_utf8
           is_string_utf8_encoded
+          file_is_non_utf8_text
           fail
           strip
           lstrip
@@ -859,6 +860,27 @@ sub file_is_encoded_in_non_utf8 {
     return $line;
 }
 
+=item file_is_non_utf8_text (...)
+
+Both binary files and text files encoded in proper UTF8 give a negative
+answer.
+
+=cut
+
+sub file_is_non_utf8_text {
+    my ($file) = @_;
+
+    my $fd = ($file->file_info =~ m/gzip compressed/) ? $file->open_gz : $file->open;
+    my $bad=0;
+    while (<$fd>) {
+        return close($fd), 0 if (!m/^[\t\n\f\r -\x{ff}]+$/);
+        $bad=1 if (!is_string_utf8_encoded($_));
+    }
+    close($fd);
+
+    return $bad;
+}
+
 =item system_env (CMD)
 
 Behaves like system (CMD) except that the environment of CMD is
-- 
2.1.4


--- End Message ---
--- Begin Message ---
Source: lintian
Source-Version: 2.80.0
Done: Chris Lamb <lamby@debian.org>

We believe that the bug you reported is fixed in the latest version of
lintian, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 796170@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Chris Lamb <lamby@debian.org> (supplier of updated lintian package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Format: 1.8
Date: Wed, 03 Jun 2020 09:30:24 +0000
Source: lintian
Built-For-Profiles: nocheck
Architecture: source
Version: 2.80.0
Distribution: unstable
Urgency: medium
Maintainer: Debian Lintian Maintainers <lintian-maint@debian.org>
Changed-By: Chris Lamb <lamby@debian.org>
Closes: 368792 796170 961961
Changes:
 lintian (2.80.0) unstable; urgency=medium
 .
   * Summary of tag changes:
     + Added:
       - national-encoding-in-text-file
 .
   [ Chris Lamb ]
   * Check for execute_after/execute_before spelling mistakes, etc., just
     like for override_.
 .
   [ Felix Lechner ]
   * Add check for nationally encoded text files in installation packages.
     (Closes: #796170)
   * Mention discussion about allowing some paths for the
     script-not-executable tag. (Closes: #368792)
   * Fix regex for Guile bytecode with respect to ELF-related tags.
     (Closes: #961961)
 .
   [ Paul Wise ]
   * Add several spelling corrections.
Checksums-Sha1:
 d4e510be0a65b57cc63bf868a2a391322e7cc5a4 4233 lintian_2.80.0.dsc
 d6872a75e0f604557cc85f09c57bc5efb20627f7 1936144 lintian_2.80.0.tar.xz
 db011898ec52cae4af6c3571e65a58cb6db9a683 5924 lintian_2.80.0_amd64.buildinfo
Checksums-Sha256:
 59141cb4ef98a35ac9c78b9e7fcd79c315fbec5d784b448cc75e3f94f642e9ab 4233 lintian_2.80.0.dsc
 3b65f40f0f98c21f7065b4f45bfd3baa12ac79a5fcbc97a5331b9850e9010bb0 1936144 lintian_2.80.0.tar.xz
 2462a31daf93c97f21a33c1a11a063615c8c465d6690c8067abb9cd2dcf7c36c 5924 lintian_2.80.0_amd64.buildinfo
Files:
 2abb08710136564c0c4ccaadf0222e4c 4233 devel optional lintian_2.80.0.dsc
 ea01cc0f5307a09be8f5084670533457 1936144 devel optional lintian_2.80.0.tar.xz
 0e4f2aff86edf7fa2503d8edd90102c3 5924 devel optional lintian_2.80.0_amd64.buildinfo

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEwv5L0nHBObhsUz5GHpU+J9QxHlgFAl7XbqYACgkQHpU+J9Qx
HlhpZQ/+OPHzP3CJ5eKgRGkenhbVai0qb9R6MAo/nBS61NoJ2XHuP3Ip37WGaZJU
3MThCSsTEdJ7I627yXrNRS9OW2t8lb3u16J8JyV2JD7XiWSnROf5shHDP8PgfrJk
0vNtKniRIMC9aZrqfiacNFwU+1gHMD7LmNEoxwooZ226tZlDSsisKNrbGnJDAwOm
w0C/nm4EMZNGbWboS08fzTcZFw3IFQSYVebh4guaIHLDsqPig1mThfbkRNkdTThe
NPBPXPyh3maQfNGagtThLBdsYYMWY6vuMrsea9zih0PpxgyMxzwb988bzwpmwXhi
nYCRgjXwyjMSYlrRS9kZ9T7pekwil8sj6fPegFBkfxNLsQqGQx0wkoo3PWcDOygD
ZgNmzspKXn9ntxNzhJT6/aOkM1VpPfy0ioEvmOfI0yS4YB0KkBzul+RpBk7HtG7P
uqCnwRYB4erfdJlWJweh3dlaPaB3mYNoPKDCHJ/SurKwNmYKf8vNC10YpTHIoTv/
FJO1VKDgCj2qGWUQb98ACrX0h4dWRLXUDskHpYXBRPWe8F2BJWZcXuoTooX7m4s+
MdHRtpWSXfGvSWElafmiMjcbCtMOXhbQbPA425rONUAFE8eq8/JaYbV8y6F6RDiu
U8s5D59WOZ7ZQcPOS4tig5+B5+4pwpK0jWZN/ET5NHaXriTKQOA=
=TlJR
-----END PGP SIGNATURE-----

--- End Message ---

Reply to: