[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[UDD] Encoding problems with unicode strings



Hi,

I observed encoding problems when reading descriptions from
UDD if they do contain non-ASCII characters and I wonder
what I might do wrong.  Here is a little test program which
queries for some descriptions I found to be problematic:

########################################################
#!/usr/bin/python
PORT=5441
import psycopg2
from sys import stderr, exit

conn = psycopg2.connect(host="localhost",port=PORT,user="guest",database="udd")
curs = conn.cursor()

query = """PREPARE query_desc (text) AS SELECT description, long_description, version FROM packages
                WHERE package = $1 AND architecture = 'i386' and release = 'sid'"""
curs.execute(query)

for pkg in ['mafft', 'melting', 'rnahybrid', 't-coffee']:
    query = "EXECUTE query_desc ('%s')" % pkg
    curs.execute(query)
    for row in curs.fetchall():
        try:
            string = unicode(row[1]) 
            print "%s: %s (%s)\n%s\n" % (pkg, row[0], row[2], row[1])
        except UnicodeDecodeError, errtxt:
            print >> stderr, "----> %s UnicodeDecodeError: '%s'; ErrTxt: %s" % \
                                    (pkg, row[1], errtxt)

########################################################

This results in:

----> mafft UnicodeDecodeError: ' MAFFT is a multiple sequence alignment program which offers three
 accuracy-oriented methods:
  * L-INS-i (probably most accurate; recommended for <200 sequences;
    iterative refinement method incorporating local pairwise alignment
    information),
  * G-INS-i (suitable for sequences of similar lengths; recommended for
    <200 sequences; iterative refinement method incorporating global
    pairwise alignment information),
  * E-INS-i (suitable for sequences containing large unalignable regions;
    recommended for <200 sequences),
 and five speed-oriented methods:
  * FFT-NS-i (iterative refinement method; two cycles only),
  * FFT-NS-i (iterative refinement method; max. 1000 iterations),
  * FFT-NS-2 (fast; progressive method),
  * FFT-NS-1 (very fast; recommended for >2000 sequences; progressive
    method with a rough guide tree),
  * NW-NS-PartTree-1 (recommended for <E2><88><BC>50,000 sequences; progressive
    method with the PartTree algorithm).'; ErrTxt: 'ascii' codec can't decode byte 0xe2 in position 889: ordinal not in range(128)
----> melting UnicodeDecodeError: ' This program computes, for a nucleic acid duplex, the enthalpy, the
 entropy and the melting temperature of the helix-coil
 transitions. Three types of hybridisation are possible: DNA/DNA,
 ...


I tried several encode / decode combinations but without success.
Is there a simple solution to handle those non-ASCII strings
apropriately?

Kind regards

        Andreas.

-- 
http://fam-tille.de


Reply to: