[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#750247: python-debian: deb822 wrong result when space in newline after paragraph



Hi apt and devscripts maintainers,

Tools parsing deb822-style documents currently disagree about where paragraphs 
should be split within the document. The result is that wrap-and-sort from 
devscripts currently eats control files if there are separator lines between 
paragraphs that contain whitespace, while the rest of our build system accepts 
whitespace-only lines as separators.

wrap-and-sort uses python-debian's deb822 module to work out what's in 
debian/control so bugs about wrap-and-sort's behaviour here have ended up with 
python-debian. I've sought to address this problem and attached is a patch for 
python-debian that fixes it by explicitly including whitespace in python-
debian's test for new lines between paragraphs.

The problem with this patch is that python-debian's deb822 parser isn't 
necessarily used by devscripts... python-debian will use python-apt's 
apt_pkg.TagFile if it is available and it also has this same behaviour. If I 
were to apply this patch to python-debian, there would be different behaviour 
depending on whether python-apt is installed and whether the text being 
interpreted is a file or a sequence/string:

* if python-apt is not installed or the text is not a file or the call to
  iter_paragraphs includes "use_apt_pkg=False", then the code path in
  deb822.py is used and the split is done correctly. Using the same test
  case as in #655988 where the correct answer should be '3':

  $ python -c 'import deb822; print len([p for p in
deb822.Deb822.iter_paragraphs(open("control-wspace"), use_apt_pkg=False)])'
  3

  $ python -c 'import deb822; print len([p for p in
deb822.Deb822.iter_paragraphs(open("control-wspace").readlines())])'
  3

* if python-apt is installed and the text is in a file and "use_apt_pkg=False"
  is not given, then libapt will *not* split the paragraphs as desired.

  $ python -c 'import deb822; print len([p for p in
deb822.Deb822.iter_paragraphs(open("control-wspace"))])'
  2

:(


So... we could change python-debian with the attached patch and *also* change 
devscripts/control.py to pass "use_apt_pkg=False" and that would fix wrap-and-
sort. 

However, making those two changes alone would mean that the use of 
apt_pkg.TagFile is not just a performance boost for iter_paragraphs as 
advertised in its documentation, it will also have different behaviour. I don't 
think that's a particularly good idea. It's creating a horrible interface and 
potentially making for very difficult debugging for other users of deb822.py.

python-apt maintainers: do you think it's reasonable to change apt_pkg.TagFile 
(presumably by changing libapt-pkg) to split paragraphs not only on blank 
lines but also on whitespace-only lines? For reference, policy §5.1 permits 
such control files with pretty rubbery language:

  The paragraphs are separated by empty lines. Parsers may accept lines
  consisting solely of spaces and tabs as paragraph separators, but control
  files should use empty lines. 

I tend to err on the side of the parser being lax and the generator being 
strict, which makes me think that both deb822.iter_paragraphs and 
apt_pkg.TagFile should split on these whitespace-only lines.


cheers
Stuart

-- 
Stuart Prescott    http://www.nanonanonano.net/   stuart@nanonanonano.net
Debian Developer   http://www.debian.org/         stuart@debian.org
GPG fingerprint    90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7
From c8bbd5a9776fa03811ef11f73aa598059f64e002 Mon Sep 17 00:00:00 2001
From: Stuart Prescott <stuart@debian.org>
Date: Wed, 4 Jun 2014 00:49:01 +1000
Subject: [PATCH] Allow whitespace-only lines to separate paragraphs

iter_paragraphs should split paragraphs on whitespace-only lines.

Paragraphs should be separated by a blank line but policy permits parsers to
permit whitespace only lines too.

Closes #715558.
---
 lib/debian/deb822.py |  4 +++-
 tests/test_deb822.py | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/lib/debian/deb822.py b/lib/debian/deb822.py
index 98bf9f3..ed933e4 100644
--- a/lib/debian/deb822.py
+++ b/lib/debian/deb822.py
@@ -627,7 +627,9 @@ class Deb822(Deb822Dict):
         gpg_post_lines = []
         state = b'SAFE'
         gpgre = re.compile(br'^-----(?P<action>BEGIN|END) PGP (?P<what>[^-]+)-----$')
-        blank_line = re.compile(b'^$')
+        # Include whitespace-only lines in blank lines to split paragraphs.
+        # (see #715558)
+        blank_line = re.compile(b'^\s*$')
         first_line = True
 
         for line in sequence:
diff --git a/tests/test_deb822.py b/tests/test_deb822.py
index a989a32..614bd5a 100755
--- a/tests/test_deb822.py
+++ b/tests/test_deb822.py
@@ -445,6 +445,30 @@ class TestDeb822(unittest.TestCase):
                 self.assertWellParsed(d, PARSED_PACKAGE)
             self.assertEqual(count, 2)
 
+    def test_iter_paragraphs_with_extra_whitespace(self):
+        """ Paragraphs not elided when stray whitespace is between
+
+        From policy §5.1:
+
+            The paragraphs are separated by empty lines. Parsers may accept
+            lines consisting solely of spaces and tabs as paragraph separators,
+            but control files should use empty lines.
+
+        On the principle of "be strict in what you send; be generous in
+        what you receive", deb822 should permit such extra whitespace between
+        deb822 stanzas.
+
+        See #715558 for further details.
+        """
+        for extra_space in (" ", "  ", "\t"):
+            text = (UNPARSED_PACKAGE + '%s\n' % extra_space
+                        + UNPARSED_PACKAGE).splitlines()
+            count = 0
+            for d in deb822.Deb822.iter_paragraphs(text):
+                count += 1
+            self.assertEqual(count, 2,
+                        "Wrong number paragraphs were found: %d != 2" % count)
+
     def _test_iter_paragraphs(self, filename, cls, **kwargs):
         """Ensure iter_paragraphs consistency"""
         
-- 
1.9.1

Attachment: signature.asc
Description: This is a digitally signed message part.


Reply to: