[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

A dak-based data source for contributors.debian.org



Hello,

a working version of http://contributors.debian.org is now online, and
I'm now trying to get data sources set up. The site is designed so that
each team takes care of its own data mining and sends it to the server.

I'd like to ask you to set up a data source sending maintainer and
uploader data to the site.

While developing the site I played with getting data out of dak. I'm
attaching the current code; until Alioth is down, the whole repository
can be found at http://people.debian.org/~enrico/dc.git.tar.xz

With that code, this command line will query dak and post data to the
site[1]:

  ./dc-tool --source=ftp.debian.org --mine=examples/dak.cfg --auth-token=… --post

You can use that code or just roll your own: the format and the protocol
really are rather simple. Protocol details are at:
https://wiki.debian.org/DebianContributors but it's really just a simple
piece of JSON to be posted as a file field in a form over HTTP.

The general idea is that each data source provides data about one or
more types of contributions. My guess is that dak knows at least about
maintainers (who do packaging work and write their names in changelogs)
and uploaders (who sign an upload, possibly sponsored, and upload it).
It's really up to you what kinds of contributions you can mine, though.

There is no need to go way back with dates if you don't have the data
readily available: I'm more interested in who's a contributor now, and
I'm about to implement a way to hide older dates for data sources that
cannot currently reliably go arbitrarily back in time.

I'd like to ask you to please set up some periodical mining and posting
on your side. I'm happy to help as I can.


Ciao,

Enrico

[1] The auth token can be found at https://contributors.debian.org/sources/update/ftp.debian.org/ after
    having logged in with a web password at http://nm.debian.org; the
    login link at contributors.debian.org is currently broken.
-- 
GPG key: 4096R/E7AD5568 2009-05-08 Enrico Zini <enrico@enricozini.org>
# coding: utf8
# Debian Contributors data source data mining tools for dak
#
# Copyright (C) 2013  Enrico Zini <enrico@debian.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
from ..core import *
import email.utils
import psycopg2
import logging

log = logging.getLogger(__name__)

__all__ = ["DakUploaders", "DakMaintainers"]

class Dak(object):
    def __init__(self, ctype, cfg):
        self.db = psycopg2.connect(cfg["db"])
        self.ctype = ctype

    def query_uploaders(self):
        log.debug("Querying uploaders for %s...", self.ctype)
        cur = self.db.cursor()
        cur.execute("""
        SELECT s.install_date, u.uid, u.name
          FROM source s
          JOIN fingerprint f ON s.sig_fpr = f.id
          JOIN uid u ON f.uid = u.id
        """)
        for dt, uid, name in cur:
            if name is not None:
                name = name.decode("utf8", errors="replace")
            yield Identifier("login", uid, name), dt.date()

    def query_maintainers(self):
        log.debug("Querying maintainers for %s...", self.ctype)
        cur = self.db.cursor()
        cur.execute("""
        SELECT s.install_date, c.name
          FROM source s
          JOIN maintainer c ON s.changedby = c.id
        """)
        for dt, m_name in cur:
            realname, emailaddr = email.utils.parseaddr(m_name)
            realname = realname.decode("utf8", errors="replace")
            yield Identifier("email", emailaddr, realname), dt.date()

    def _query_to_submission(self, generator, submission):
        count_rows = 0
        by_ident = {}
        for ident, date in generator:
            count_rows += 1
            c = by_ident.get(ident, None)
            if c is None:
                by_ident[ident] = Contribution(self.ctype, date, date)
            else:
                c.extend_by_date(date)
        count_contribs = 0
        for ident, contrib in by_ident.iteritems():
            count_contribs += 1
            submission.add_contribution(ident, contrib)
        log.debug("%d rows read into %d contributions", count_rows, count_contribs)

class DakUploaders(Dak):
    """
    Scan git directories using file attributes to detect contributions
    """
    def scan(self, submission):
        self._query_to_submission(self.query_uploaders(), submission)


class DakMaintainers(Dak):
    """
    Scan git directories using file attributes to detect contributions
    """
    def scan(self, submission):
        self._query_to_submission(self.query_maintainers(), submission)

Attachment: signature.asc
Description: Digital signature


Reply to: