Re: Maintainer: field in .changes

To: debian-mentors@lists.debian.org
Subject: Re: Maintainer: field in .changes
From: Steve Langasek <vorlon@netexpress.net>
Date: Tue, 20 Jan 2004 09:56:54 -0600
Message-id: <[🔎] 20040120155654.GD392@tennyson.netexpress.net>
Mail-followup-to: debian-mentors@lists.debian.org
In-reply-to: <[🔎] 878yk2he9w.fsf@florent.maison>
References: <[🔎] 1074540568.11187.101.camel@localhost> <[🔎] 1074543577.1585.103.camel@descent.netsplit.com> <[🔎] 877jznvvch.fsf@alhambra.bioz.unibas.ch> <[🔎] 20040120115405.GA3831@deprecation.cyrius.com> <[🔎] 878yk2he9w.fsf@florent.maison>

On Tue, Jan 20, 2004 at 03:43:07PM +0100, Florent Rougon wrote:
> Martin Michlmayr <tbm@cyrius.com> wrote:

> > (If anyone knows how to convert a string to UTF-8 in Python regardless
> > of whether it's UTF-8 or Latin or ASCII, and to convert a string to
> > ASCII/Latin regardless to whether it's UTF-8 or Latin, speak up now...)

> I would say it is impossible, be it in Python or anything else: there
> are many byte sequences that are at the same time a valid
> Latin-1-encoded string and a valid UTF-8-encoded string (with both
> strings being different, but encoding one in Latin-1 and the other in
> UTF-8 happens to produce the same byte sequence).

> Example:

> /tmp % od -tx1 test-file
> 0000000 c3 a9 0a
> 0000003

> If you read the c3 a9 string as Latin-1, you get (from iso-8859-1(7)):

>   LATIN CAPITAL LETTER A WITH TILDE
>   COPYRIGHT SIGN

> But if you consider this very same byte sequence as a UTF-8-encoded
> string, you read it as the single character:

>   LATIN SMALL LETTER E WITH ACUTE (U+00E9)

> As a general rule, if you want to convert reliably between two
> charsets/encodings, you'd better know precisely how the input is
> encoded.

The best heuristic is to first check whether it's valid UTF-8, and if it
isn't, convert it from latin-1 to UTF-8.  This correctly detects the
vast majority of texts; but if what you want is UTF-8, it's always
better to use that in the first place.

-- 
Steve Langasek
postmodern programmer

Attachment: pgp0TGwbfjV17.pgp
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Maintainer: field in .changes
  - From: Florent Rougon <f.rougon@free.fr>

References:
- Maintainer: field in .changes
  - From: Adriaan Peeters <apeeters@lashout.net>
- Re: Maintainer: field in .changes
  - From: Scott James Remnant <scott@netsplit.com>
- Re: Maintainer: field in .changes
  - From: frank@kuesterei.ch (Frank Küster)
- Re: Maintainer: field in .changes
  - From: Martin Michlmayr <tbm@cyrius.com>
- Re: Maintainer: field in .changes
  - From: Florent Rougon <f.rougon@free.fr>

Prev by Date: Re: Maintainer: field in .changes
Next by Date: Re: Maintainer: field in .changes
Previous by thread: Re: Maintainer: field in .changes
Next by thread: Re: Maintainer: field in .changes
Index(es):
- Date
- Thread