A dak-based data source for contributors.debian.org


a working version of http://contributors.debian.org is now online, and
I'm now trying to get data sources set up. The site is designed so that
each team takes care of its own data mining and sends it to the server.

I'd like to ask you to set up a data source sending maintainer and
uploader data to the site.

While developing the site I played with getting data out of dak. I'm
attaching the current code; until Alioth is down, the whole repository
can be found at http://people.debian.org/~enrico/dc.git.tar.xz

With that code, this command line will query dak and post data to the

  ./dc-tool --source=ftp.debian.org --mine=examples/dak.cfg --auth-token=… --post

You can use that code or just roll your own: the format and the protocol
really are rather simple. Protocol details are at:
https://wiki.debian.org/DebianContributors but it's really just a simple
piece of JSON to be posted as a file field in a form over HTTP.

The general idea is that each data source provides data about one or
more types of contributions. My guess is that dak knows at least about
maintainers (who do packaging work and write their names in changelogs)
and uploaders (who sign an upload, possibly sponsored, and upload it).
It's really up to you what kinds of contributions you can mine, though.

There is no need to go way back with dates if you don't have the data
readily available: I'm more interested in who's a contributor now, and
I'm about to implement a way to hide older dates for data sources that
cannot currently reliably go arbitrarily back in time.

I'd like to ask you to please set up some periodical mining and posting
on your side. I'm happy to help as I can.



[1] The auth token can be found at https://contributors.debian.org/sources/update/ftp.debian.org/ after
    having logged in with a web password at http://nm.debian.org; the
    login link at contributors.debian.org is currently broken.
class Dak(object):
    def __init__(self, ctype, cfg):
        self.db = psycopg2.connect(cfg["db"])
        self.ctype = ctype

    def query_uploaders(self):
        log.debug("Querying uploaders for %s...", self.ctype)
        cur = self.db.cursor()
        SELECT s.install_date, u.uid, u.name
          FROM source s
          JOIN fingerprint f ON s.sig_fpr = f.id
          JOIN uid u ON f.uid = u.id
        for dt, uid, name in cur:
            if name is not None:
                name = name.decode("utf8", errors="replace")
            yield Identifier("login", uid, name), dt.date()

    def query_maintainers(self):
        log.debug("Querying maintainers for %s...", self.ctype)
        cur = self.db.cursor()
        SELECT s.install_date, c.name
          FROM source s
          JOIN maintainer c ON s.changedby = c.id
        for dt, m_name in cur:
            realname, emailaddr = email.utils.parseaddr(m_name)
            realname = realname.decode("utf8", errors="replace")
            yield Identifier("email", emailaddr, realname), dt.date()

    def _query_to_submission(self, generator, submission):
        count_rows = 0
        by_ident = {}
        for ident, date in generator:
            count_rows += 1
            c = by_ident.get(ident, None)
            if c is None:
                by_ident[ident] = Contribution(self.ctype, date, date)
        count_contribs = 0
        for ident, contrib in by_ident.iteritems():
            count_contribs += 1
            submission.add_contribution(ident, contrib)
        log.debug("%d rows read into %d contributions", count_rows, count_contribs)

class DakUploaders(Dak):
    Scan git directories using file attributes to detect contributions
    def scan(self, submission):
        self._query_to_submission(self.query_uploaders(), submission)

class DakMaintainers(Dak):
    Scan git directories using file attributes to detect contributions
    def scan(self, submission):
        self._query_to_submission(self.query_maintainers(), submission)

