Re: Postgres - Unicode - Problem
To say it first, I was able to work around the problem by just prepending
LC_CTYPE=UTF-8 before the psql call in the application, i.e. the following
diff did the trick:
- cmd = 'psql -q -d "%s" -f "%s"' % (aDB, SQL_file)
+ cmd = 'LC_CTYPE=UTF-8 psql -q -d "%s" -f "%s"' % (aDB, SQL_file)
result = os.system(cmd)
even if it's not really clear why (for me and the upstream author).
The file which was read by psql (-f parameter) is in fact
~> file german-gmclinical.sql
german-gmclinical.sql: ISO-8859 English text
~> grep -i encod german-gmclinical.sql
SET CLIENT_ENCODING TO 'LATIN1';
and so it is quite strange that this LC_CTYPE=UTF-8 helps ...
On Fri, 13 Jun 2003, Ulrich Eckhardt wrote:
> I seem to remember that pg also offered something like UTF8. The point is that
> 'Unicode' is in most places just a buzzword. Especially in this case, the
> exact encoding would be much better as Unicode can be represented with
> several encodings.
UTF8 seems to be involved in any form ...
> > INSERT INTO i18n_translations(lang, orig, trans) values
> > ('de_DE', 'public', 'öffentlich');
> >
> > ERROR: Unicode >= 0x10000 is not supported
>
> So, this looks like it can only take UCS2 or UTF16. However, the question is
> in what way did it interpret the command to get to a character with a
> codepoint >= 0x10000 ?
I guess the error message is just wrong. There are no codes >= 0x10000.
The file is just missinterpreted.
> Possible ways:
> - UCS4: here, one char uses four bytes, but that should already have failed
> for the commands before then
> - USC2/UTF16: two bytes per char(plus sequences for UTF16), else the same as
> above
The file stores obviousely one byte per character.
> - UTF8: one byte per char but multibyte-chars being rather common. I'm not
> sure how it could interpret this, but try saving it as UTF8 (and _not_
> ISO8859-1, which many editors[1] silently do).
I would love if Emacs would keep ISO8859-1 if I insert some cut-and-paste
buffer with Umlauts. :-( But this is another topic ...
> - ASCII: using a 'signed char', they might end up with a negative codepoint
> for the umlaut, resulting in an underflow and the above error.
Well, this might be a possible interpretation of the problem.
> [1] apt-get install yudit
> That is a rather capable editor that understands several encodings.
I might give it a try ...
Kind regards
Andreas.
Reply to: