Re: Maintainer: field in .changes

To: debian-mentors@lists.debian.org
Subject: Re: Maintainer: field in .changes
From: Florent Rougon <f.rougon@free.fr>
Date: Tue, 20 Jan 2004 20:09:36 +0100
Message-id: <[🔎] 871xpuh1xr.fsf@florent.maison>
Mail-followup-to: debian-mentors@lists.debian.org
In-reply-to: <[🔎] 20040120155654.GD392@tennyson.netexpress.net> (Steve Langasek's message of "Tue, 20 Jan 2004 09:56:54 -0600")
References: <[🔎] 1074540568.11187.101.camel@localhost> <[🔎] 1074543577.1585.103.camel@descent.netsplit.com> <[🔎] 877jznvvch.fsf@alhambra.bioz.unibas.ch> <[🔎] 20040120115405.GA3831@deprecation.cyrius.com> <[🔎] 878yk2he9w.fsf@florent.maison> <[🔎] 20040120155654.GD392@tennyson.netexpress.net>

Steve Langasek <vorlon@netexpress.net> wrote:

> The best heuristic is to first check whether it's valid UTF-8, and if it
> isn't, convert it from latin-1 to UTF-8.  This correctly detects the
> vast majority of texts; but if what you want is UTF-8, it's always
> better to use that in the first place.

OK, then my response to Martin's question is in the attached script
(x2utf8). Basically, it boils down to:

    try:
        u = input.decode("utf_8")
    except UnicodeError:
        u = input.decode("iso8859_1")

to get a Unicode object from the input using your heuristic and then:

    u.encode("utf_8")

to encode it in UTF-8. The rest is mostly error handling to be sure to
catch every possible conversion error. [note that Python 2.3 enables
finer error handling in this area]

Smoke test
----------

Notes:
  - If your mailer thinks it sees UTF-8 in this mail, it is broken.
  - My terminal for the following session transcript yields ISO 8859-15
    (I have LC_CTYPE set to fr_FR@euro) encoded strings, and all the
    characters used for this session are exactly the same in ISO 8859-1,
    so you can consider they were passed to the various programs as
    ISO 8859-1 (Latin-1).

# Feed the script with a Latin-1-encoded string
% echo "Ouééé, c môrtèèèl" | x2utf8 | od -tx1
0000000 4f 75 c3 a9 c3 a9 c3 a9 2c 20 63 20 6d c3 b4 72
0000020 74 c3 a8 c3 a8 c3 a8 6c 0a
0000031

# Feed it with the same string, but UTF-8-encoded. Same result.
% echo "Ouééé, c môrtèèèl" | recode latin1..utf8 | x2utf8 | od -tx1
0000000 4f 75 c3 a9 c3 a9 c3 a9 2c 20 63 20 6d c3 b4 72
0000020 74 c3 a8 c3 a8 c3 a8 6c 0a
0000031

# How does the UTF-8 encoding of "é" (LATIN SMALL LETTER E WITH ACUTE)
# look like if interpreted as Latin-1?
% echo "é" | recode latin1..utf8
Ã©

# Feed this character (LATIN SMALL LETTER E WITH ACUTE) as UTF-8 to the
# script. We get its UTF-8 encoding. Right.
% echo "é" | recode latin1..utf8 | x2utf8 | od -tx1
0000000 c3 a9 0a
0000003

# Now, modify the second byte of the 2-bytes sequence that represents
# LATIN SMALL LETTER E WITH ACUTE in UTF-8 so that the resulting
# sequence is not a valid UTF-8-encoded string anymore. The string gets
# interpreted as Latin-1 by the script and it outputs its UTF-8 encoding.
#   [ c3 83 is the UTF-8 encoding of "Ã" and 61 is the ASCII code of
#     "a", so it is also its UTF-8 representation. ]
% echo "Ãa" | x2utf8 | od -tx1
0000000 c3 83 61 0a
0000004

Attachment: x2utf8
Description: Unknown to UTF-8 conversion

The script works with Python 2.2 or greater. I think it could be made to
work relatively easily with 2.1, but I didn't bother.

-- 
Florent

Reply to:

Follow-Ups:
- Re: Maintainer: field in .changes
  - From: Florent Rougon <f.rougon@free.fr>

References:
- Maintainer: field in .changes
  - From: Adriaan Peeters <apeeters@lashout.net>
- Re: Maintainer: field in .changes
  - From: Scott James Remnant <scott@netsplit.com>
- Re: Maintainer: field in .changes
  - From: frank@kuesterei.ch (Frank Küster)
- Re: Maintainer: field in .changes
  - From: Martin Michlmayr <tbm@cyrius.com>
- Re: Maintainer: field in .changes
  - From: Florent Rougon <f.rougon@free.fr>
- Re: Maintainer: field in .changes
  - From: Steve Langasek <vorlon@netexpress.net>

Prev by Date: Re: Maintainer: field in .changes
Next by Date: po2debconf errors
Previous by thread: Re: Maintainer: field in .changes
Next by thread: Re: Maintainer: field in .changes
Index(es):
- Date
- Thread