[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Questions regarding utf-8



On Fri, May 09, 2003 at 02:51:23AM +0200, Andreas Bombe wrote:
> On Thu, May 08, 2003 at 07:50:50PM -0400, Bob Hilliard wrote:
> >      Some third-party dictionaries, such as foldoc and The Jargon File
> > occasionally include 8 bit characters, such as 0xe7 for c-cedilla.  In
> > order to fix these easily, I would like to know:
> > 
> >      1.  How can I determine what character encoding is used in a
> >          document without manually scanning the entire file?
> 
> Hm, I can't thing of an easy way.  Maybe someone knows available tools
> to do that.

It's not really possible. Arbitrary 8-bit data could be anything;
without analysis based on a bunch of dictionary files for every language
likely to occur in the file I can't see how you could guess the encoding
for pre-UTF-8 text. ISO-8859-1 is probably the most likely for
predominantly English texts like foldoc and the Jargon File though.

Cheers,

-- 
Colin Watson                                  [cjwatson@flatline.org.uk]



Reply to: