[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Postgres - Unicode - Problem



To say it first, I was able to work around the problem by just prepending
LC_CTYPE=UTF-8 before the psql call in the application, i.e. the following
diff did the trick:


-       cmd = 'psql -q -d "%s" -f "%s"' % (aDB, SQL_file)
+       cmd = 'LC_CTYPE=UTF-8 psql -q -d "%s" -f "%s"' % (aDB, SQL_file)

        result = os.system(cmd)

even if it's not really clear why (for me and the upstream author).
The file which was read by psql (-f parameter) is in fact

       ~> file german-gmclinical.sql
       german-gmclinical.sql: ISO-8859 English text
       ~> grep -i encod german-gmclinical.sql
       SET CLIENT_ENCODING TO 'LATIN1';

and so it is quite strange that this LC_CTYPE=UTF-8 helps ...


On Fri, 13 Jun 2003, Ulrich Eckhardt wrote:

> I seem to remember that pg also offered something like UTF8. The point is that
> 'Unicode' is in most places just a buzzword. Especially in this case, the
> exact encoding would be much better as Unicode can be represented with
> several encodings.
UTF8 seems to be involved in any form ...

> > INSERT INTO i18n_translations(lang, orig, trans) values
> > 	('de_DE', 'public', 'öffentlich');
> >
> > ERROR:  Unicode >= 0x10000 is not supported
>
> So, this looks like it can only take UCS2 or UTF16. However, the question is
> in what way did it interpret the command to get to a character with a
> codepoint >= 0x10000 ?
I guess the error message is just wrong.  There are no codes >= 0x10000.
The file is just missinterpreted.

> Possible ways:
> - UCS4: here, one char uses four bytes, but that should already have failed
> for the commands before then
> - USC2/UTF16: two bytes per char(plus sequences for UTF16), else the same as
> above
The file stores obviousely one byte per character.

> - UTF8: one byte per char but multibyte-chars being rather common. I'm not
> sure how it could interpret this, but try saving it as UTF8 (and _not_
> ISO8859-1, which many editors[1] silently do).
I would love if Emacs would keep ISO8859-1 if I insert some cut-and-paste
buffer with Umlauts. :-( But this is another topic ...

> - ASCII: using a 'signed char', they might end up with a negative codepoint
> for the umlaut, resulting in an underflow and the above error.
Well, this might be a possible interpretation of the problem.

> [1] apt-get install yudit
> That is a rather capable editor that understands several encodings.
I might give it a try ...

Kind regards

        Andreas.



Reply to: