[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Maintainer: field in .changes



Martin Michlmayr <tbm@cyrius.com> wrote:

> (If anyone knows how to convert a string to UTF-8 in Python regardless
> of whether it's UTF-8 or Latin or ASCII, and to convert a string to
> ASCII/Latin regardless to whether it's UTF-8 or Latin, speak up now...)

I would say it is impossible, be it in Python or anything else: there
are many byte sequences that are at the same time a valid
Latin-1-encoded string and a valid UTF-8-encoded string (with both
strings being different, but encoding one in Latin-1 and the other in
UTF-8 happens to produce the same byte sequence).

Example:

/tmp % od -tx1 test-file
0000000 c3 a9 0a
0000003

If you read the c3 a9 string as Latin-1, you get (from iso-8859-1(7)):

  LATIN CAPITAL LETTER A WITH TILDE
  COPYRIGHT SIGN

But if you consider this very same byte sequence as a UTF-8-encoded
string, you read it as the single character:

  LATIN SMALL LETTER E WITH ACUTE (U+00E9)

As a general rule, if you want to convert reliably between two
charsets/encodings, you'd better know precisely how the input is
encoded.

-- 
Florent



Reply to: