Steve Langasek <vorlon@netexpress.net> wrote: > The best heuristic is to first check whether it's valid UTF-8, and if it > isn't, convert it from latin-1 to UTF-8. This correctly detects the > vast majority of texts; but if what you want is UTF-8, it's always > better to use that in the first place. OK, then my response to Martin's question is in the attached script (x2utf8). Basically, it boils down to: try: u = input.decode("utf_8") except UnicodeError: u = input.decode("iso8859_1") to get a Unicode object from the input using your heuristic and then: u.encode("utf_8") to encode it in UTF-8. The rest is mostly error handling to be sure to catch every possible conversion error. [note that Python 2.3 enables finer error handling in this area] Smoke test ---------- Notes: - If your mailer thinks it sees UTF-8 in this mail, it is broken. - My terminal for the following session transcript yields ISO 8859-15 (I have LC_CTYPE set to fr_FR@euro) encoded strings, and all the characters used for this session are exactly the same in ISO 8859-1, so you can consider they were passed to the various programs as ISO 8859-1 (Latin-1). # Feed the script with a Latin-1-encoded string % echo "Ouééé, c môrtèèèl" | x2utf8 | od -tx1 0000000 4f 75 c3 a9 c3 a9 c3 a9 2c 20 63 20 6d c3 b4 72 0000020 74 c3 a8 c3 a8 c3 a8 6c 0a 0000031 # Feed it with the same string, but UTF-8-encoded. Same result. % echo "Ouééé, c môrtèèèl" | recode latin1..utf8 | x2utf8 | od -tx1 0000000 4f 75 c3 a9 c3 a9 c3 a9 2c 20 63 20 6d c3 b4 72 0000020 74 c3 a8 c3 a8 c3 a8 6c 0a 0000031 # How does the UTF-8 encoding of "é" (LATIN SMALL LETTER E WITH ACUTE) # look like if interpreted as Latin-1? % echo "é" | recode latin1..utf8 é # Feed this character (LATIN SMALL LETTER E WITH ACUTE) as UTF-8 to the # script. We get its UTF-8 encoding. Right. % echo "é" | recode latin1..utf8 | x2utf8 | od -tx1 0000000 c3 a9 0a 0000003 # Now, modify the second byte of the 2-bytes sequence that represents # LATIN SMALL LETTER E WITH ACUTE in UTF-8 so that the resulting # sequence is not a valid UTF-8-encoded string anymore. The string gets # interpreted as Latin-1 by the script and it outputs its UTF-8 encoding. # [ c3 83 is the UTF-8 encoding of "Ã" and 61 is the ASCII code of # "a", so it is also its UTF-8 representation. ] % echo "Ãa" | x2utf8 | od -tx1 0000000 c3 83 61 0a 0000004
Attachment:
x2utf8
Description: Unknown to UTF-8 conversion
The script works with Python 2.2 or greater. I think it could be made to work relatively easily with 2.1, but I didn't bother. -- Florent