git-copyright-scan: find authors missing from DEP-5 debian/copyright
Hi,
while updating debian/copyright of a package to match a new upstream
version I noticed that upstream is already generating AUTHORS and
license headers from git log. It felt odd to parse these headers back to
debian/copyright completely manually.
Could we compare git log against debian/copyright? git-copyright-scan is
a proof-of-concept that tries to locate authors that are listed in git
log but are missing from DEP-5 debian/copyright. I ran
git-copyright-scan --git-opt --before=2010-01-01 --min-commits 20 \
--min-lines 20
against all source packages that have Vcs-Git field and use DEP-5 and
got the following list of potentially forgotten authors (BEWARE: that
has a lot of false positives due to issue 3) below)
http://lindi.iki.fi/lindi/dep5/scan6.txt
I ignored authors who have done only minor contributions and also very
new authors (since debian/copyright might be slightly out of date which
I guess is the trend...)
I hit the following issues:
1) Not everyone uses UTF-8 in git log. Fuzzing matching should help here.
2) DEP-5 does not specify the format of Copyright: lines, only that each
copyright holder should be on its own line. It would be nice if at least
the simplest cases used a canonical "Copyright X, Y, Z Foo Bar
<foo@bar.example.com>" format.
3) Vcs-Git often points to a Vcs that does not have upstream commit
history. Could we consider something like "Vcs-Upstream-Git" in
debian/control? The non-"debian/*" hits of the above scan are meaningful
only for packages that have upstream history in the Vcs-Git.
Finally, if you are not afraid of hacky python code the sources are
available in
http://iki.fi/lindi/git/git-copyright-scan.git/
-Timo
Reply to: