[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#656288: python3-apt: difficulties with non-UTF-8-encoded TagFiles



Package: python3-apt
Version: 0.8.3
Severity: normal

In Python 3, I can find no way to get apt_pkg.TagFile to read a file
that isn't encoded in UTF-8:

  >>> import sys
  >>> import apt_pkg
  >>> sys.version
  '3.2.2+ (default, Jan  8 2012, 07:26:18) \n[GCC 4.6.2]'
  >>> with open("test", "w", encoding="iso-8859-1") as test:
  ...     print("Package: test", file=test)
  ...     print("Maintainer: M\xe4intainer <test@example.org>", file=test)
  ...     print(file=test)
  ...
  >>> tagfile = apt_pkg.TagFile(open("test", "rb"))
  >>> next(tagfile)["Maintainer"]
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte
  >>> tagfile = apt_pkg.TagFile(open("test", encoding="iso-8859-1"))
  >>> next(tagfile)["Maintainer"]
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte

Whereas in Python 2:

  >>> import sys
  >>> import apt_pkg
  >>> sys.version
  '2.7.2+ (default, Jan 13 2012, 23:15:17) \n[GCC 4.6.2]'
  >>> tagfile = apt_pkg.TagFile(open("test", "rb"))
  >>> tagfile.next()["Maintainer"]
  'M\xe4intainer <test@example.org>'

This breaks part of the python-debian test suite (I'm currently trying
to port python-debian to Python 3), which is interested in such things
as making sure that it's possible to parse old Sources files from before
Debian switched to UTF-8.

A fix is tricky.  We can't do anything actually nice using Python 3's
I/O facilities, because python-apt just pokes around to find the file
descriptor and passes that directly to apt.  However, one idea that
comes to mind is that if you open a file with the 'encoding' parameter
then python-apt could spot that in the file object, remember it, and
decode bytes using that encoding any time it wants to return a Unicode
string.

python-debian's test suite also tests that it's possible to parse old
Sources files in *mixed* encodings.  This is going to be harder because
it basically means having apt_pkg.TagSection return bytes, which I don't
think is desirable in general.  Maybe this could be optional somehow?

Thanks,

-- 
Colin Watson                                       [cjwatson@debian.org]



Reply to: