Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically

To: debian-med@lists.debian.org
Subject: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
From: Andreas Tille <andreas@an3as.eu>
Date: Mon, 20 Feb 2012 22:36:12 +0100
Message-id: <[🔎] 20120220213612.GF20898@an3as.eu>
In-reply-to: <[🔎] 20120220160051.GA1136@falafel.plessy.net>
References: <[🔎] 20120217221626.GB8823@an3as.eu> <[🔎] 20120218152517.GA28208@falafel.plessy.net> <[🔎] 20120218155340.GE20647@an3as.eu> <[🔎] 20120219095216.GA15125@falafel.plessy.net> <[🔎] 20120219141319.GD21819@an3as.eu> <[🔎] 20120220010120.GD30636@falafel.plessy.net> <[🔎] 20120220084309.GA10901@an3as.eu> <[🔎] 20120220160051.GA1136@falafel.plessy.net>

Hi Charles,

On Tue, Feb 21, 2012 at 01:00:51AM +0900, Charles Plessy wrote:
> > there are tools which assemble informations for Sources.gz files - I guess
> > this could be implemented if say 20% of the packages will contain such a
> > file.
> 
> In such a model, the packages need to be uploaded so that Sources.gz is
> updated.  This is exactly what I aim at avoiding by feeding the UDD with
> Umegaya.

I did perfectly understood this but if you want to have a
debian/upstream file in every Debian package you finally need to relay
on uploaded packages.  The assumption that all packages are maintained
in any form of Vcs (=reasonable team maintained or at least
maintainable) is simply wrong even for widely used packages like r-base
for instance.  So a general solution can not be based on Vcs status of
packages.

> > I admit I do not trust that a developer will really do regular visits to
> > http://upstream-metadata.debian.net/foo/YAML-URL or any similar URL.
> 
> Note that anybody can trigger a refresh.

The fact that anybody *can* do the triggering does not mean that
everybody is actually doing it.  I just now learned about this chance
and I'm considering myself as very interested.  So it is most probably
*nobody* doing it except you.

Moreover I guess this is also a weak point of your system.  Any kid
could break your sever by pushing it to trigger a whole lot of updates
and kill your server.  So you seem to maintain a server which is on
one hand easy to kill but not used in practice.

> For instance, I ran this command to
> load all the upstream metadata for the packages known by debcheckout, and that
> are recommended by one of our tasks.
> 
> for package in $(svn cat svn://svn.debian.org/blends/projects/med/trunk/debian-med/debian/control | grep Recommends | sed -e 's/,//g' -e 's/|//g' -e 's/Recommends://g' ); do curl http://upstream-metadata.debian.net/$package/Name ; done
> 
> I can set up a cron job along these lines, in addition to VCS hooks.

Hmmm, bibref gatherer just runs into 404 error - no way for me to check
the success of this operation.

> > BTW, it came to my mind that we should also gather
> > fields from debian/copyright if it is DEP5 compatible.  I specifically
> > consider Upstream-Contact a very valuable field and at a later stage I
> > would even ask for a lintian check "Upstream-Contact is missing" or
> > something like this.
> 
> I actually opposed - with no success - the includsion of the Upstream-Contact
> and Upstream-Name fields in DEP 5 as they usually do not contribute to respect
> the package's redistribution terms, with is the purpose of the Debian copyright
> file.
> 
> The debian/upstream file features Contact and Name fields that can be used
> for the same purpose.

This is what I mean:  Despite the fact that I perfectly agree to your
opinion that Upstream-Contact is some information which better fits to
debian/upstream rather than debian/copyright you try to solve a problem
that practically does not exist.  This information is "traditionally"
inside debian/copyright and is not totally wrong there (unlike bibref
which is actually totally wrong in debian/copyright).  So we actually
have some solution which worked in the past and will perfectly work in
the future.  What you are trying is convincing people to solve a
non-existing problem.  I'm not astonished that you failed in convincing
people to accept this (even if I like to repeat my perfect understanding
of your motivation).

> >   1. scripts/fetch_bibref.sh
> >      fetches all available debian/upstream files and move them to
> >      /org/udd.debian.org/mirrors/upstream/package.upstream
> >      I would like to stress the fact that I would fetch these
> >      files *unchanged* as they are edited by the author
> >   2. udd/bibref_gatherer.py
> >      Just parse the upstream files for bibliographic information
> >      and push them into UDD
> >      This is the really cheap part of the job and I volunteer to
> >      do this in one afternoon.
> 
> The problem with this approach is that it can only run on udd.debian.org,
> which is quite loaded if I understand well.

Why do you think so?  If you think about my quick proof of principle
yesterday that involved ssh-ing into alioth you are right.  But if we
just assume a simple cron job doing something similar but in the end
creating a downloadable tarball and move this to some http-accessible
place (or even a fake Vcs repository - whatever might come to mind)
there are perfectly good chances to collect all upstream files.  In
some way you are fetching the files as well - so why should this only
work on udd.debian.org?  None of the steps above will create a high
workload on udd.debian.org.

> Regardless the mean, I provide a table that can be downloaded daily and that
> can be loaded in the UDD.  That is how the gatherers work as I have seen so
> far.  That the data transits in a Berkeley DB is just a detail.  It is as
> unimportant as having the data processed with one programming language or
> another.  What matters is the final product, the table to be loaded.

I tend to disagree.  The data are usually gathered in a possibly raw
format without any intermediate step of a third party database.  While I
could in principle live with this intermediate step I totally fail to
see any reason for this extra step of complexity.

> > However, regarding practical usage of these data I do not see
> > an application currently.  You need a problem first which needs to be
> > solved to invent something new.
> 
> The goal of the sytem is:
> 
>  - Let the maintainer update the data without uploading the package,

ACK, this is the purpose of the debian/upstream file.

>  - Gather data for our tasks pages.  In addition to the bibliography,
>    I think that, while rare, the Registration and Donation fields
>    can be very useful to better cooperate with Upstream.
> 
> http://upstream-metadata.debian.net/table/registration
> http://upstream-metadata.debian.net/table/donation

Well, I agree that this information could be perfectly be added to
debian/upstream files.  I continue failing to understand the need for an
additional aggregation database inbetween the debian/upstream files and
UDD.

> >    dh_bibref
> > 
> > which turns debian/upstream data into a usable BibTeX database on the
> > users system.  This is technically definitely not hard - it just needs
> > to be *done*.
> 
> The challenge will be to have it ran by default by Debhelper.  But
> I think that indeed it is the good direction.  In the meantime, such
> a tool will need to produce a reference that is stored in the directory.

As I said we should split this topic to a different thread - these mails
are long enough.  And as I said I have a plan about this which could be
discussed.

> >   A. Gather *all* existing debian/upstream files and making sure they
> >      will be updated after at least 24h at a place where they can be
> >      fetched for UDD (I explicitely do not mention that we should do this
> >      via the web service and I would really prefer not to go the detour
> >      of another database)
> 
> Currently I have the following cron job running on debian-med.debian.net:
> 
> @hourly for key in DOI PMID Reference-Author Reference-Eprint Reference-Journal Reference-Number Reference-Pages Reference-Title Reference-URL Reference-Volume Reference-Year References; do curl -s http://upstream-metadata.debian.net/yaml/$key; done > public_html/biblio.yaml
> 
> Therefore, the bibliographic data can now be accessed at the following URL.
> 
> http://upstream-metadata.debian.net/~plessy/biblio.yaml

I just adapted UDD SVN according to this new URL to let the gatherer
work again and not run into 404 as the last two days.  However, the
result also does neither contain cufflinks and upstream.  I guess this
might be due to your design to query 

   svn://svn.debian.org/blends/projects/med/trunk/debian-med/debian/control

for *Recommends* which does not involve cufflinks which is non-free :-(
and thus can only be listed as Suggests as well as proftmb which is not
yet rendered into the control file because this was not yet regenerated
after the entry of proftmb into Debian (and it would not be included
even now because it is not yet in testing = wheezy so it remains in
Suggests as well).  So if you really want to fetch all our packages you
need to parse rather the tasks files than the resulting control files
(and somehow handle those not yet uploaded packages).

I might repeat that I'd consider the direct access to Vcs the more
promising way but I will not question your approach in principle.  For
the moment 54 packages end up in UDD in contrast to 95 debian/upstream
files I detected in SVN+Git.

> Let's see how it goes before deciding to redo everyghing from scratch with a
> new design.

I would like to make clear that we are discussing here about two
different things:

  1. The way we are gathering the actual input files debian/upstream.
     Here I'm positive that we can somehow use + enhance your method
     if it permits a fully automatic way.
  2. The fact whether an intermediate database should be used or not.
     I'm against this approach if this has no additional value.

Kind regards

       Andreas.

-- 
http://fam-tille.de

Reply to:

Follow-Ups:
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>

References:
- Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>

Prev by Date: Re: [MoM] Any progress?
Next by Date: Turning debian/upstream into BibTeX (Was: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically)
Previous by thread: Turning debian/upstream into BibTeX (Was: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically)
Next by thread: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Index(es):
- Date
- Thread