Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles
Package: python3-apt
Version: 0.8.3
Severity: normal
In Python 3, I can find no way to get apt_pkg.TagFile to read a file
that isn't encoded in UTF-8:
>>> import sys
>>> import apt_pkg
>>> sys.version
'3.2.2+ (default, Jan 8 2012, 07:26:18) \n[GCC 4.6.2]'
>>> with open("test", "w", encoding="iso-8859-1") as test:
... print("Package: test", file=test)
... print("Maintainer: M\xe4intainer <test@example.org>", file=test)
... print(file=test)
...
>>> tagfile = apt_pkg.TagFile(open("test", "rb"))
>>> next(tagfile)["Maintainer"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte
>>> tagfile = apt_pkg.TagFile(open("test", encoding="iso-8859-1"))
>>> next(tagfile)["Maintainer"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte
Whereas in Python 2:
>>> import sys
>>> import apt_pkg
>>> sys.version
'2.7.2+ (default, Jan 13 2012, 23:15:17) \n[GCC 4.6.2]'
>>> tagfile = apt_pkg.TagFile(open("test", "rb"))
>>> tagfile.next()["Maintainer"]
'M\xe4intainer <test@example.org>'
This breaks part of the python-debian test suite (I'm currently trying
to port python-debian to Python 3), which is interested in such things
as making sure that it's possible to parse old Sources files from before
Debian switched to UTF-8.
A fix is tricky. We can't do anything actually nice using Python 3's
I/O facilities, because python-apt just pokes around to find the file
descriptor and passes that directly to apt. However, one idea that
comes to mind is that if you open a file with the 'encoding' parameter
then python-apt could spot that in the file object, remember it, and
decode bytes using that encoding any time it wants to return a Unicode
string.
python-debian's test suite also tests that it's possible to parse old
Sources files in *mixed* encodings. This is going to be harder because
it basically means having apt_pkg.TagSection return bytes, which I don't
think is desirable in general. Maybe this could be optional somehow?
Thanks,
--
Colin Watson [cjwatson@debian.org]
Reply to: