[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Maintainer: field in .changes



Steve Langasek <vorlon@netexpress.net> wrote:

> The best heuristic is to first check whether it's valid UTF-8, and if it
> isn't, convert it from latin-1 to UTF-8.  This correctly detects the
> vast majority of texts; but if what you want is UTF-8, it's always
> better to use that in the first place.

OK, then my response to Martin's question is in the attached script
(x2utf8). Basically, it boils down to:

    try:
        u = input.decode("utf_8")
    except UnicodeError:
        u = input.decode("iso8859_1")

to get a Unicode object from the input using your heuristic and then:

    u.encode("utf_8")

to encode it in UTF-8. The rest is mostly error handling to be sure to
catch every possible conversion error. [note that Python 2.3 enables
finer error handling in this area]


Smoke test
----------

Notes:
  - If your mailer thinks it sees UTF-8 in this mail, it is broken.
  - My terminal for the following session transcript yields ISO 8859-15
    (I have LC_CTYPE set to fr_FR@euro) encoded strings, and all the
    characters used for this session are exactly the same in ISO 8859-1,
    so you can consider they were passed to the various programs as
    ISO 8859-1 (Latin-1).

# Feed the script with a Latin-1-encoded string
% echo "Ouééé, c môrtèèèl" | x2utf8 | od -tx1
0000000 4f 75 c3 a9 c3 a9 c3 a9 2c 20 63 20 6d c3 b4 72
0000020 74 c3 a8 c3 a8 c3 a8 6c 0a
0000031

# Feed it with the same string, but UTF-8-encoded. Same result.
% echo "Ouééé, c môrtèèèl" | recode latin1..utf8 | x2utf8 | od -tx1
0000000 4f 75 c3 a9 c3 a9 c3 a9 2c 20 63 20 6d c3 b4 72
0000020 74 c3 a8 c3 a8 c3 a8 6c 0a
0000031

# How does the UTF-8 encoding of "é" (LATIN SMALL LETTER E WITH ACUTE)
# look like if interpreted as Latin-1?
% echo "é" | recode latin1..utf8
é

# Feed this character (LATIN SMALL LETTER E WITH ACUTE) as UTF-8 to the
# script. We get its UTF-8 encoding. Right.
% echo "é" | recode latin1..utf8 | x2utf8 | od -tx1
0000000 c3 a9 0a
0000003

# Now, modify the second byte of the 2-bytes sequence that represents
# LATIN SMALL LETTER E WITH ACUTE in UTF-8 so that the resulting
# sequence is not a valid UTF-8-encoded string anymore. The string gets
# interpreted as Latin-1 by the script and it outputs its UTF-8 encoding.
#   [ c3 83 is the UTF-8 encoding of "Ã" and 61 is the ASCII code of
#     "a", so it is also its UTF-8 representation. ]
% echo "Ãa" | x2utf8 | od -tx1
0000000 c3 83 61 0a
0000004

Attachment: x2utf8
Description: Unknown to UTF-8 conversion

The script works with Python 2.2 or greater. I think it could be made to
work relatively easily with 2.1, but I didn't bother.

-- 
Florent

Reply to: