Re: Questions regarding utf-8

To: martin@v.loewis.de, hilliard@debian.org
Cc: debian-devel@lists.debian.org
Subject: Re: Questions regarding utf-8
From: John Darrington <john@cellform.com.au>
Date: Fri, 16 May 2003 07:38:30 +0800
Message-id: <[🔎] 3EC424F6.2070600@cellform.com>

I have a neural net program ( http://www.nongnu.org/libann/doc/libann_6.html#SEC26 ) which does something similar:

Given a text file, it will attempt to guess the natural language in which it was written.

I'm sure it would be fairly simple to modify it to guess the charset. If you point meto a reasonably large set of example files, I'll see what I can do. --- It would never

be 100% accurate, but would probably make a good guess at the problem.

Bob Hilliard wrote:


    1.  How can I determine what character encoding is used in a
        document without manually scanning the entire file?

> You can't do that automatically, in generally. If you know what text
> you expect, and you know the bytes you have in the file, you can

> try a number of encodings, and see which of the encodings gives thecharacters you expect.> As a manual procedure, this is best done with the help of/usr/share/i18n/charmaps. This> lists the Unicode character position, the encoding-spe cific byte[sequence], and the character name.



> So if you know you have \xe7, and you know it is c-cedilla, it could

> be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16},cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc.

Reply to:

Follow-Ups:
- Re: Questions regarding utf-8
  - From: "Matthias Urlichs" <smurf@smurf.noris.de>

Prev by Date: Re: security in testing
Next by Date: Re: security in testing
Previous by thread: Re: [debian-devel] Questions regarding utf-8
Next by thread: Re: Questions regarding utf-8
Index(es):
- Date
- Thread