Re: Questions regarding utf-8

To: Bob Hilliard <hilliard@debian.org>
Cc: debian-devel@lists.debian.org
Subject: Re: Questions regarding utf-8
From: era eriksson <era@iki.fi>
Date: Fri, 16 May 2003 06:22:07 +0300
Message-id: <[🔎] 16068.22879.690108.134584@there.afraid.org>
In-reply-to: <[🔎] 3EBAF6EF.8040107@v.loewis.de>
References: <20030509002011$5f1b@gated-at.bofh.it> <[🔎] 3EBAF6EF.8040107@v.loewis.de>

On Fri, 09 May 2003 02:31:43 +0200, Martin v. Löwis wrote:
 > Bob Hilliard wrote:
 > >  1.  How can I determine what character encoding is used in a
 > >         document without manually scanning the entire file?

First off, for the examples you mentioned (foldoc and the jargon file)
the iso-8859-1 hypothesis is likely to be correct more than 99% of the
time, and you should ask yourself whether the residue matters much for
the tasks you want to accomplish. Also the amount of 8-bit data is
likely so small that you can simply review it all, case by case.

The above paragraph was added as an afterthought, +after+ I had
written the rest of this message, and now I've spent too much time on
writing this to simply discard the rest of the text. But you can stop
reading here if you like. :-)

 > You can't do that automatically, in generally. If you know what text
 > you expect, and you know the bytes you have in the file, you can
 > try a number of encodings, and see which of the encodings gives the 
 > characters you expect. As a manual procedure, this is best done with the 
 > help of /usr/share/i18n/charmaps. This lists the Unicode character 
 > position, the encoding-specific byte [sequence], and the character name.
 > So if you know you have \xe7, and you know it is c-cedilla, it could
 > be iso-8859-1. It could also be iso-8859-{2,3,9,14,15,16}, 
 > cp125{0,2,4,6}, DEC-MCS, SAMI-WS2, etc.

Actually, language guessers like <http://packages.debian.org/mguesser>
are by nature also coding system guessers. If you write Italian in
UTF-8, it's going to look different from Italian in ISO-646-it (if
such a thing exists) or Italian in ISO-8859-1. (The difference between
8859-1 and 8859-15 is of course so minor that it is decidable only in
special circumstances, regardless of whether you are a human or a
computer).

So in practice, the guesser program has to guess at the language and
the coding system at the same time. This is often quite doable,
although probably some very sparsely populated coding systems need a
lot of input before they can be (learned or) categorized. (Dunno what
this would be -- I'm guessing some Far East codings might be
problematic.)

As ever, there are language pairs which can't be decided in all
circumstances, either. Danish and Norwegian Bokmål are so closely
related as to be indistinguishable, especially in small samples,
occasionally even to native speakers.

Language categorization is typically based on n-gram analysis; you
break up the stream into overlapping fixed-length str, tri, rin, ing,
ngs, gs , s o,  of, of , f c, ch, cha, har, ara, rac, act, cte, ter
samples, and the frequency distribution of these (in this example,
3-grams, aka trigrams) is often sufficient to make a good guess,
provided you have solid training data to compare against.

<http://odur.let.rug.nl/~vannoord/TextCat/> has some more background
and an on-line demo. The list of supported languages also has samples
of each -- quite instructive to look at. (Write me off-list for more
pointers.)

I have not seen any academic treatment of the coding system aspect of
this problem, but the systems I've tried would generally cope with it,
more or less.

Given some fair assumptions about the coherence of a file's contents,
you would only need to submit the first couple of lines -- if even
that -- to the guesser in order to get pretty accurate results most of
the time. The one big issue which remains to be solved is to gather a
representative amount of training material for each language/encoding
pair you want to be able to recognize.

Oh, and of course, don't expect human-produced text to be anything
like coherent in practice. |^5d k001 |-|4x0r d00dz are only the tip of
a very nonlinear iceberg. Even publishing-quality material is often
not really "quality" when you start to look into it.

Hope this helps,

/* era */

(Sorry, not on the list -- if you have a reply for me personally,
please mail or at least Cc: me.)

-- 
Join the civilized world -- ban spam like we did! <http://www.euro.cauce.org/>
   tee -a $HOME/.signature <$HOME/.plan >http://www.iki.fi/era/index.html

Reply to:

References:
- Re: Questions regarding utf-8
  - From: "Martin v. Löwis" <martin@v.loewis.de>

Prev by Date: Re: i386 compatibility & libstdc++
Next by Date: Bug#193499: ITP: python-sqlobject -- An object-relational mapper for Python
Previous by thread: Re: Questions regarding utf-8
Next by thread: Re: Questions regarding utf-8
Index(es):
- Date
- Thread