Re: Postgres - Unicode - Problem

To: Debian Developers <debian-devel@lists.debian.org>
Cc: Debian PostgreSQL Liste <debian-postgresql@mailman.atnet.at>, Karsten Hilbert <Karsten.Hilbert@gmx.net>
Subject: Re: Postgres - Unicode - Problem
From: Andreas Tille <tillea@rki.de>
Date: Fri, 13 Jun 2003 08:59:35 +0200 (CEST)
Message-id: <[🔎] Pine.LNX.4.44.0306130846430.17529-100000@wr-linux02.rki.ivbb.bund.de>
In-reply-to: <[🔎] 200306130819.29992.uli@doommachine.dyndns.org>

To say it first, I was able to work around the problem by just prepending
LC_CTYPE=UTF-8 before the psql call in the application, i.e. the following
diff did the trick:


-       cmd = 'psql -q -d "%s" -f "%s"' % (aDB, SQL_file)
+       cmd = 'LC_CTYPE=UTF-8 psql -q -d "%s" -f "%s"' % (aDB, SQL_file)

        result = os.system(cmd)

even if it's not really clear why (for me and the upstream author).
The file which was read by psql (-f parameter) is in fact

       ~> file german-gmclinical.sql
       german-gmclinical.sql: ISO-8859 English text
       ~> grep -i encod german-gmclinical.sql
       SET CLIENT_ENCODING TO 'LATIN1';

and so it is quite strange that this LC_CTYPE=UTF-8 helps ...


On Fri, 13 Jun 2003, Ulrich Eckhardt wrote:

> I seem to remember that pg also offered something like UTF8. The point is that
> 'Unicode' is in most places just a buzzword. Especially in this case, the
> exact encoding would be much better as Unicode can be represented with
> several encodings.
UTF8 seems to be involved in any form ...

> > INSERT INTO i18n_translations(lang, orig, trans) values
> > 	('de_DE', 'public', 'öffentlich');
> >
> > ERROR:  Unicode >= 0x10000 is not supported
>
> So, this looks like it can only take UCS2 or UTF16. However, the question is
> in what way did it interpret the command to get to a character with a
> codepoint >= 0x10000 ?
I guess the error message is just wrong.  There are no codes >= 0x10000.
The file is just missinterpreted.

> Possible ways:
> - UCS4: here, one char uses four bytes, but that should already have failed
> for the commands before then
> - USC2/UTF16: two bytes per char(plus sequences for UTF16), else the same as
> above
The file stores obviousely one byte per character.

> - UTF8: one byte per char but multibyte-chars being rather common. I'm not
> sure how it could interpret this, but try saving it as UTF8 (and _not_
> ISO8859-1, which many editors[1] silently do).
I would love if Emacs would keep ISO8859-1 if I insert some cut-and-paste
buffer with Umlauts. :-( But this is another topic ...

> - ASCII: using a 'signed char', they might end up with a negative codepoint
> for the umlaut, resulting in an underflow and the above error.
Well, this might be a possible interpretation of the problem.

> [1] apt-get install yudit
> That is a rather capable editor that understands several encodings.
I might give it a try ...

Kind regards

        Andreas.

Reply to:

References:
- Re: Postgres - Unicode - Problem
  - From: Ulrich Eckhardt <uli@doommachine.dyndns.org>

Prev by Date: Re: Postgres - Unicode - Problem
Next by Date: Re: Bug#154829: Is a bug "grave" if the package is unusable for at least two architectures (Was: Bug#154829: Aido problems on 64 bit architectures)
Previous by thread: Re: Postgres - Unicode - Problem
Next by thread: libpng status update
Index(es):
- Date
- Thread