[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Status of UTF-8 support in Lintian



Hi,

Earlier this week, I removed the PerlIO layers for automatic UTF-8
conversion from Lintian and the test suite. The changes in this and
the following commit:

    https://salsa.debian.org/lintian/lintian/-/commit/a5584fcc6e4c14723006d2f552834d29c2ed314d

The commits are comprehensive in the sense that they make most
conversions to and from UTF-8 explicit. The rationale is outlined in
Bug#972878.The underlying Perl bugs have been unsolved since 2007 and
2011.

The remedy we ultimately adopted was suggested by the good folks on
#perl-help. Due to the broad impact of the changes, which took over a
thousand edits, there was concern about introducing new bugs.

We do not yet have the ability to explore differences between Lintian
versions in the archive (which may be possible with the new website at
some point) so for now, I only looked at program errors caused by
ill-formed UTF-8 octet sequences. They arose in three instances.

In two cases, upstream sources shipped scripts with hashbang (#!)
interpreters in non UTF-8 characters. Since Debian uses UTF-8 for file
names, these scripts cannot run in Debian (and, for files being
shipped as installable, the non UTF-8 encoding would also be flagged
by Lintian) but there is nothing a Debian maintainer can do.  The
conversion was disabled here:

    https://salsa.debian.org/lintian/lintian/-/commit/86997a883d101662fff8e49e844d7b496e0b39e4

It affected the following two sources: the file
'szotar/szoszablya/ragozatlan.2' in magyarispell_1.6.1-2.dsc and the
file 'tests/d2/dmd-testsuite/compilable/test13512.d' (a test file) in
ldc_1.24.0-1.dsc. Those errors are now gone.

A more difficult (and still unresolved) issue arose in the installable
debug package libc6-dbg_2.31-5_amd64.deb. Debug packages are generated
by Debian, and should therefore be clean. The file
'usr/lib/debug/.build-id/a2/78dac1d4a7d4aaf37f8c21dba517e3b68663c5.debug'
produces readelf output that is not clean. It can be reproduced with
this command:

    readelf --wide --segments --dynamic --section-details --symbols
--version-info

Readelf by itself does not guarantee output in UTF-8, but it should
produce nothing else as a result of other restrictions in Debian.
Perhaps most significantly, this file—a set of generated debug
sybols—is literally the only file in our archive that trips up this
error. According to readelf, the file requests a non-intelligible
interpreter:

  INTERP         0x001000 0x0000000000193f00 0x0000000000193f00
0x000000 0x00001c R   0x10
      [Requesting program interpreter:
���Gb�U3��T��Aopx��a�F?T�e��6�UE?�,y;���?X��A�?�߮��k��?����]

Due to the unique nature of the error, and the garbage potentially
provided to readelf there is a presumption that the debug file was
created incorrectly, and was caused by a bug elsewhere. The file
currently produces several program errors like this:

    Warning in group glibc/2.31-5: Can't decode ill-formed UTF-8 octet
sequence <FF> in position 10050 at ./lib/Lintian/Index/Objdump.pm line
84.

I plan to follow up with the maintainer of gcc and objcopy, which
created the debug file, once I figure out whom to approach.

On a positive note, the UTF-8 changes discussed here are expected to
help greatly with the resolution of the open bug for UTF-8 file names,
Bug#956233.

Kind regards
Felix Lechner


Reply to: