On Tue, Jan 20, 2004 at 03:43:07PM +0100, Florent Rougon wrote: > Martin Michlmayr <tbm@cyrius.com> wrote: > > (If anyone knows how to convert a string to UTF-8 in Python regardless > > of whether it's UTF-8 or Latin or ASCII, and to convert a string to > > ASCII/Latin regardless to whether it's UTF-8 or Latin, speak up now...) > I would say it is impossible, be it in Python or anything else: there > are many byte sequences that are at the same time a valid > Latin-1-encoded string and a valid UTF-8-encoded string (with both > strings being different, but encoding one in Latin-1 and the other in > UTF-8 happens to produce the same byte sequence). > Example: > /tmp % od -tx1 test-file > 0000000 c3 a9 0a > 0000003 > If you read the c3 a9 string as Latin-1, you get (from iso-8859-1(7)): > LATIN CAPITAL LETTER A WITH TILDE > COPYRIGHT SIGN > But if you consider this very same byte sequence as a UTF-8-encoded > string, you read it as the single character: > LATIN SMALL LETTER E WITH ACUTE (U+00E9) > As a general rule, if you want to convert reliably between two > charsets/encodings, you'd better know precisely how the input is > encoded. The best heuristic is to first check whether it's valid UTF-8, and if it isn't, convert it from latin-1 to UTF-8. This correctly detects the vast majority of texts; but if what you want is UTF-8, it's always better to use that in the first place. -- Steve Langasek postmodern programmer
Attachment:
pgp0TGwbfjV17.pgp
Description: PGP signature