Re: [UDD] Encoding problems with unicode strings

To: debian-qa@lists.debian.org
Subject: Re: [UDD] Encoding problems with unicode strings
From: Adeodato Simó <dato@net.com.org.es>
Date: Fri, 22 May 2009 18:36:39 +0200
Message-id: <[🔎] 20090522163639.GA6104@chistera.yi.org>
Mail-followup-to: debian-qa@lists.debian.org
In-reply-to: <[🔎] 20090522140048.GA6571@an3as.eu>
References: <[🔎] 20090522140048.GA6571@an3as.eu>

+ Andreas Tille (Fri, 22 May 2009 16:00:48 +0200):

> Hi,

> I observed encoding problems when reading descriptions from
> UDD if they do contain non-ASCII characters and I wonder
> what I might do wrong.  Here is a little test program which
> queries for some descriptions I found to be problematic:

UDD just has the descriptions from Packages.gz, which supposedly are in
UTF-8. If your destination (a file, terminal, whatever) should be
receiving UTF-8, you can just pass them unmodified, eg.:

    for row in curs.fetchall():
        print "%s: %s (%s)\n%s\n" % (pkg, row[0], row[2], row[1])

That works for me.

If, for some reason, you need unicode() and not str() objects, then you
should specify that the string is in UTF-8, otherwise it will default to
ASCII:

    for row in curs.fetchall():
        string = unicode(row[1], 'utf-8') 

So, your test program is not of much help. If you're still stuck, you
should probably say what are you really trying to do, with details. But
I don't think it's going to be a problem in UDD.

P.S.: If doing `unicode(row[1], 'utf-8')` raises an exception, that
would be because a package contains non-UTF8 in a description. Your
program should be robust against that, and you can do:

    try:
        string = unicode(row[1], 'utf-8') 
      except UnicodeDecodeError:
        string = unicode(row[1], 'latin1') 

[And file a bug against the package as well.]

HTH,

-- 
- Are you sure we're good?
- Always.
        -- Rory and Lorelai

Reply to:

Follow-Ups:
- Re: [UDD] Encoding problems with unicode strings
  - From: Andreas Tille <andreas@an3as.eu>

References:
- [UDD] Encoding problems with unicode strings
  - From: Andreas Tille <andreas@an3as.eu>

Prev by Date: [UDD] Encoding problems with unicode strings
Next by Date: Re: [UDD] Encoding problems with unicode strings
Previous by thread: [UDD] Encoding problems with unicode strings
Next by thread: Re: [UDD] Encoding problems with unicode strings
Index(es):
- Date
- Thread