DEP-5: an example parser, choice of syntax for Files:

To: debian-devel@lists.debian.org
Subject: DEP-5: an example parser, choice of syntax for Files:
From: Jon Dowland <jon+debian-devel@alcopop.org>
Date: Sun, 13 Sep 2009 23:58:46 +0100
Message-id: <[🔎] 20090913225846.GB16109@tchicaya.lan>

Given that DEP-5 is supposed to be about machine-
readability, I thought it would be worthwhile trying to
write something to parse the proposed format.  Please find
attached a short python script that I have written based on
the current text of DEP-5 at dep.debian.net[1].

It's designed to be run from an unpacked and patched source
package (or at least a source tree containing
debian/copyright, which it attempts to parse). It will
print out a list of each Files: stanza found in the
copyright, followed by the list of files which it believes
are matched by the stanza.

It has proven useful to me: I found several bugs in a
copyright file I'd written for a real live package, based
on my misinterpretation of the current wording.

Whilst writing this, I found the syntax chosen for the
Files: field to be very awkward. Indeed my crude parser
only handles a subset of the syntax so far (no escapes, no
handling of quoted strings).

Most of the examples given in DEP-5 containing the path
character will not work, either, e.g.

    Files: debian/*

Assuming they are passed into a find(1) invocation like so

    find . -path 'debian/*'

(note the presence of the path separator and the wording
about that in the text)

they need to be prefixed with './', even if you omit '.' in
the find execution (which itself is a GNUism iirc).  Patch
attached.

I think I would much prefer using regular expressions here.
For one thing I'm worried about variations in find(1)
behaviours across platforms. For another, unless a parser
calls find(1) (as I have, and it's expensive), trying to
match its behaviour will imho be a lot more error prone
than using your languages built-in regular expression
library or pcre or whatever. I will try to cook a patch for
comment.

[1] (I need to re-read the older DEP-5 messages to
    understand the current maintainership situation: I see
    Steve remove the other drivers in that version, and
    Charles do the same in his git repo...)


-- 
Jon Dowland

#!/usr/bin/python
# a crude DEP-5 parser

# Copyright (c) 2009 Jon Dowland <jmtd@debian.org>
# Copying and distribution of this file, with or without modification, are
# permitted in any medium without royalty provided the copyright notice and this
# notice are preserved.

# usage: run the script from within an unpacked source tarball with the debian
# diff.gz applied on top (or at least, a DEP-5-syntax debian/copyright file
# available)

from email import parser
from sys import exit
from os import popen

##############################################################################
## step 1: handle/parse RFC822 superset

# remove blank lines so the parser treats it all as an email header
copyright = parser.Parser().parsestr(
    ''.join(
        filter(lambda x: "\n" != x,
            open("debian/copyright").readlines()
)))
if len(copyright.keys()) < 1:
    print "parser didn't get any headers from the copyright file"
    exit(1)

##############################################################################
## step 2: interpret the headers and build a list of tuples
##      (files, license, copyright)

# DEP5 header. Format-Specification is required. Others are optional.
valid = "Format-Specification Name Maintainer Source Disclaimer".split()
header = dict([ [x,''] for x in valid])
files = "Files Copyright License".split()

# first loop: handle the header
for i in range(0,len(copyright.items())):
    key = copyright.keys()[i]

    # skip over x-Arbitrary: headers
    if key[0] == 'x':
        continue

    if key in valid:
        if header[key]:
            print "error: redefinition of '%s'." % key
            exit(1)
        header[key] = copyright.values()[i]
        continue

    # this marks the transition from the header onwards
    if key in files:
        if not header['Format-Specification']:
            print "error: Format-Specification must be defined " +\
                  "before the Files section"
            exit(1)
        break

    print "unrecognised key '%s'" % key
    exit(1)

# second loop: looping through the main parts
current = dict([ [x,''] for x in files])
tuples = []

# take a hash of Files/Copyright/License and split it up
# into multiple ones based on the Files key
#   first rule: multiple items separated by commas
#   XXX: unhandled: escaped commas; quoted-strings
#        containing commas
def append(tuples, current):
    for t in current['Files'].split(","):
        c = current.copy()
        c['Files'] = t.strip()
        tuples.append(c)

for i in range(i,len(copyright.items())):
    key = copyright.keys()[i]

    # skip over x-Arbitrary: headers
    if key[0] == 'x':
        continue

    if key in files:
        # handle implicit 'Files: *'
        if 'Files' != key and not current['Files']:
            current['Files'] = '*'
        # new Files: stanza ends the last one
        elif 'Files' == key and current['Files']:
            for defn in ['License', 'Copyright']:
                if not current[defn]:
                    print "error: missing %s line for Files: %s" \
                        % (defn, current['Files'])
                    exit(1)
            append(tuples,current)
            current = dict([ [x,''] for x in files])
        # new License or Copyright for existing Files:
        if current[key]:
            print "error: redefinition of '%s'. Missing 'Files' item?" % key
            print "line is %d, value is '%s'" % (i,copyright.values()[i])
            exit(1)
        current[key] = copyright.values()[i]
        continue

    print "unrecognised key '%s'" % key
    exit(1)

tuples.append(current)

# DEP-5 states "If multiple Files declarations match the same file, then only
# the last match counts.". This suggests no inheritance is possible between
# stanzas. Thus, reversing the list means we can look for the *first* matching
# stanza.
tuples.reverse()

##############################################################################
## step 3: indicate mapping of stanzas to source files
## we run find(1) for each tuple to build up a list of files which match
## the Files: definition. We then run find(1) again on the source directory
## to obtain a list of all files, then compare results.

# a list of [ (Files:, [matching files]) ] for each Files
# populated with the list of files which match each Files: key
matching = []
for t in tuples:
    nameorpath = 'name'
    if t['Files'].count('/') > 0:
        nameorpath = 'path'
    runme = "find . -type f -%s \"%s\" 2>/dev/null" % (nameorpath, t['Files'])
    matching.append( (t['Files'], [ x.strip() for x in popen(runme).readlines() ]) )

# { Files: => [matching files] }, this time populated by
# comparing every file against each stanza in turn
results = dict([ [x['Files'],[]] for x in tuples ])
results['no match'] = []
for fname in [x.strip() for x in popen('find . -type f').readlines()]:
    res = 'no match'
    for pair in matching:
        if fname in pair[1]:
            res = pair[0]
            break
    results[res].append(fname)
            
for hash in tuples:
    print "%s:" % hash['Files']
    for value in results[hash['Files']]:
        print "\tmatches %s" % value

Index: dep5.mdwn
===================================================================
--- dep5.mdwn	(revision 105)
+++ dep5.mdwn	(working copy)
@@ -144,7 +144,7 @@
 
 Example 1 (tri-licensed files).
 
-	Files: src/js/editline/*
+	Files: ./src/js/editline/*
 	Copyright: 1993, John Doe
 	           1993, Joe Average
 	License: MPL-1.1 or GPL-2 or LGPL-2.1
@@ -161,12 +161,12 @@
 
 Example 2 (recurrent license).
 
-	Files: src/js/editline/*
+	Files: ./src/js/editline/*
 	Copyright: 1993, John Doe
                    1993, Joe Average
 	License: MPL-1.1
 
-	Files: src/js/fdlibm/*
+	Files: ./src/js/fdlibm/*
 	Copyright: 1993, J-Random Corporation
 	License: MPL-1.1
 
@@ -365,7 +365,7 @@
 		 License can be found in the `/usr/share/common-licenses/GPL-2'
 		 file.
 
-		Files: debian/*
+		Files: ./debian/*
 		Copyright: 1998, Jane Smith <jsmith@example.net>
 		License:
 		 [LICENSE TEXT]
@@ -384,7 +384,7 @@
 		License: PSF-2
 		 [LICENSE TEXT]
 
-		Files: debian/*
+		Files: ./debian/*
 		Copyright: 2008, Dan Developer <dan@debian.example.com>
 		License:
 		 Copying and distribution of this package, with or without
@@ -392,27 +392,27 @@
 		 provided the copyright notice and this notice are
 		 preserved.
 
-		Files: debian/patches/theme-diveintomark.patch
+		Files: ./debian/patches/theme-diveintomark.patch
 		Copyright: 2008, Joe Hacker <hack@example.org>
 		License: GPL-2+
 		 [LICENSE TEXT]
 
-		Files: planet/vendor/compat_logging/*
+		Files: ./planet/vendor/compat_logging/*
 		Copyright: 2002, Mark Smith <msmith@example.org>
 		License: MIT
 		 [LICENSE TEXT]
 
-		Files: planet/vendor/httplib2/*
+		Files: ./planet/vendor/httplib2/*
 		Copyright: 2006, John Brown <brown@example.org>
 		License:
 		 Unspecified MIT style license.
 
-		Files: planet/vendor/feedparser.py
+		Files: ./planet/vendor/feedparser.py
 		Copyright: 2007, Mike Smith <mike@example.org>
 		License: PSF-2
 		 [LICENSE TEXT]
 
-		Files: planet/vendor/htmltmpl.py
+		Files: ./planet/vendor/htmltmpl.py
 		Copyright: 2004, Thomas Brown <coder@example.org>
 		License: GPL-2+
 		 On Debian systems the full text of the GNU General Public

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: DEP-5: an example parser, choice of syntax for Files:
  - From: Benjamin Drung <bdrung@ubuntu.com>
- Re: DEP-5: an example parser, choice of syntax for Files:
  - From: Charles Plessy <plessy@debian.org>

Prev by Date: Re: DEP-5: query about possible inheritence of License:
Next by Date: Re: DEP-5: an example parser, choice of syntax for Files:
Previous by thread: Re: Proposed mass bug filing: Perl 5.10.1 breaks 'make install PREFIX=$(TMP)/usr'
Next by thread: Re: DEP-5: an example parser, choice of syntax for Files:
Index(es):
- Date
- Thread