[UDD] Encoding problems with unicode strings
Hi,
I observed encoding problems when reading descriptions from
UDD if they do contain non-ASCII characters and I wonder
what I might do wrong. Here is a little test program which
queries for some descriptions I found to be problematic:
########################################################
#!/usr/bin/python
PORT=5441
import psycopg2
from sys import stderr, exit
conn = psycopg2.connect(host="localhost",port=PORT,user="guest",database="udd")
curs = conn.cursor()
query = """PREPARE query_desc (text) AS SELECT description, long_description, version FROM packages
WHERE package = $1 AND architecture = 'i386' and release = 'sid'"""
curs.execute(query)
for pkg in ['mafft', 'melting', 'rnahybrid', 't-coffee']:
query = "EXECUTE query_desc ('%s')" % pkg
curs.execute(query)
for row in curs.fetchall():
try:
string = unicode(row[1])
print "%s: %s (%s)\n%s\n" % (pkg, row[0], row[2], row[1])
except UnicodeDecodeError, errtxt:
print >> stderr, "----> %s UnicodeDecodeError: '%s'; ErrTxt: %s" % \
(pkg, row[1], errtxt)
########################################################
This results in:
----> mafft UnicodeDecodeError: ' MAFFT is a multiple sequence alignment program which offers three
accuracy-oriented methods:
* L-INS-i (probably most accurate; recommended for <200 sequences;
iterative refinement method incorporating local pairwise alignment
information),
* G-INS-i (suitable for sequences of similar lengths; recommended for
<200 sequences; iterative refinement method incorporating global
pairwise alignment information),
* E-INS-i (suitable for sequences containing large unalignable regions;
recommended for <200 sequences),
and five speed-oriented methods:
* FFT-NS-i (iterative refinement method; two cycles only),
* FFT-NS-i (iterative refinement method; max. 1000 iterations),
* FFT-NS-2 (fast; progressive method),
* FFT-NS-1 (very fast; recommended for >2000 sequences; progressive
method with a rough guide tree),
* NW-NS-PartTree-1 (recommended for <E2><88><BC>50,000 sequences; progressive
method with the PartTree algorithm).'; ErrTxt: 'ascii' codec can't decode byte 0xe2 in position 889: ordinal not in range(128)
----> melting UnicodeDecodeError: ' This program computes, for a nucleic acid duplex, the enthalpy, the
entropy and the melting temperature of the helix-coil
transitions. Three types of hybridisation are possible: DNA/DNA,
...
I tried several encode / decode combinations but without success.
Is there a simple solution to handle those non-ASCII strings
apropriately?
Kind regards
Andreas.
--
http://fam-tille.de
Reply to: