Re: Questions regarding utf-8

To: debian-devel@lists.debian.org
Subject: Re: Questions regarding utf-8
From: "Martin v. Löwis" <martin@v.loewis.de>
Date: Fri, 09 May 2003 02:31:43 +0200
Message-id: <[🔎] 3EBAF6EF.8040107@v.loewis.de>
In-reply-to: <20030509002011$5f1b@gated-at.bofh.it>
References: <20030509002011$5f1b@gated-at.bofh.it>

Bob Hilliard wrote:

     1.  How can I determine what character encoding is used in a
         document without manually scanning the entire file?


You can't do that automatically, in generally. If you know what text
you expect, and you know the bytes you have in the file, you can

try a number of encodings, and see which of the encodings gives thecharacters you expect. As a manual procedure, this is best done with thehelp of /usr/share/i18n/charmaps. This lists the Unicode characterposition, the encoding-specific byte [sequence], and the character name.


So if you know you have \xe7, and you know it is c-cedilla, it could

be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16},cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc.

     2.  What is the best available filter to convert from encoding X
         to 7 bit ASCII?


cat(1). It can't get much better than that. If you have an encoding and

a file that has non-ASCII characters, you can't convert correctly toASCII. So your choice is to

a) lose some information, e.g. transliterate non-representable
   characters, or replace them with a replacement character ('?')
b) break the encoding, i.e. use bytes not supported in the target
   encoding (ASCII).
Neither option is good, so I wouldn't claim that some is best.
cat(1) implements option b)

     3.  What is the difference between utf-8 and en_US.utf8?


The former is an encoding, the latter a locale. It is like
apples and oranges: both are fruit.

Regards,
Martin

Reply to:

Follow-Ups:
- Re: Questions regarding utf-8
  - From: era eriksson <era@iki.fi>

Prev by Date: Re: If Debian decides that the Gnu Free Doc License is not free then I will be honored to join Stallman and the FSF in the not free section of your distro
Next by Date: Re: Questions regarding utf-8
Previous by thread: Re: Questions regarding utf-8
Next by thread: Re: Questions regarding utf-8
Index(es):
- Date
- Thread