On Wednesday 19 January 2005 22:20, Petter Reinholdtsen wrote:
> Is there some charset problem? I looked at the
> unknown words for nb, and "går" and "når" are definitely not unknown
> words in the dictionary.
I see the same kind of problem with Dutch.
The unknown wordlist shows 'Brazilië', which is 'Brazilië' in UTF-8
(Dutch for Brazil).
I've just checked the a-spell Dutch wordlist and Brazilië _is_ included.
$ aspell dump master /usr/lib/aspell/dutch | grep "Brazil"
Braziliaanse
Braziliaans
Braziliaan
Brazilianen
Brazilië
It looks like the dump prints a ISO-8859-1 coded list.
I think the manpage for aspell gives the answer:
<quote>
--encoding=string
The encoding the input text is in. Valid values are ``utf-8'',
``iso8859-*'', ``koi8-r'', ``viscii'', ``cp1252'', ``machine
!! unsigned 16'', ``machine unsigned 32''. However, the Aspell
!! utility will currently only function correctly with 8-bit encod-
!! ings. utf-8 support is planned for the future. The two ``machine
unsigned'' encodings are intended to be used by other programs
using the Aspell library and it is unlikely the Aspell utility
will ever support these encodings.
</quote>
So it looks as if you may have to iconv the files before you test them
(or, even better, patch aspell so it supports utf-8 ;-)
Attachment:
pgpG9v3Ru12Mb.pgp
Description: PGP signature