Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically

To: debian-med@lists.debian.org
Subject: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
From: Andreas Tille <andreas@an3as.eu>
Date: Mon, 20 Feb 2012 09:43:09 +0100
Message-id: <[🔎] 20120220084309.GA10901@an3as.eu>
In-reply-to: <[🔎] 20120220010120.GD30636@falafel.plessy.net>
References: <[🔎] 20120217221626.GB8823@an3as.eu> <[🔎] 20120218152517.GA28208@falafel.plessy.net> <[🔎] 20120218155340.GE20647@an3as.eu> <[🔎] 20120219095216.GA15125@falafel.plessy.net> <[🔎] 20120219141319.GD21819@an3as.eu> <[🔎] 20120220010120.GD30636@falafel.plessy.net>

On Mon, Feb 20, 2012 at 10:01:20AM +0900, Charles Plessy wrote:
> let's imagine that every source package in Debian has a debian/upstream file.
> To refresh the information daily, it would take more than 18,000 requests on
> Alioth.
> ...
> It would take hours to check every package daily, and I worry for the
> load on Alioth.

Ahh, OK, this explains your motivation and you probably explained this
before - I just tend to forget such details.  However, for the case that
every package would have such a file I assume there would be more clever
means to gather the data.  For instance there are tools which assemble
informations for Sources.gz files - I guess this could be implemented if
say 20% of the packages will contain such a file.  Currently (and
unfortunately) the enthusiasm to do so does not look that promissing and
we have below 100 packages (my research in Debian Med repository uncovered
94 packages + Rasmol).

> This is why I designed a push model.  After updating debian/upstream for the
> package 'foo', visit http://upstream-metadata.debian.net/foo/YAML-URL, and
> Umegaya will refresh its information.  (This will work after I transfer the
> service to debian-med.debian.net; I really hope to do it this evening).

I admit I do not trust that a developer will really do regular visits to
http://upstream-metadata.debian.net/foo/YAML-URL or any similar URL.  My
experience regarding updates of the tasks files which would be nice if
people either edit the tasks files or at least drop me a note is that it
works only for a very small percentage of the developers (and even those
might throw ENOTIME).  So I would definitely drop the "do something
manually after editing debian/upstream" idea - it will not work.

There might be some chance for such a push service in the line of:

  - Vcs commit hook triggers signal
  - Debhelper tool visits URL if online / sends mail if SMTP works

or something like this but there is no chance at all to force people say
like via lintian "You included a debian/upstream file, did you really
visited URL ..."

In short:  We need to develop means that do not relay on manual user input.

> Nevertheless, as long as only Debian Med is using Umegaya, we can forcibly
> refresh the information daily.  A better way would be to have Subversion
> and Git commit hooks that do the job.  I will work on this after the transfer
> of upstream-metadata.d.n.

Yes, the commit hooks also came to my mind.  The only drawback why I did
not proposed it in my previous mail is, that we somehow need to trust
the teams to implement those hooks.  But in practical cases (Debian Med,
Debian Science and DebiChem, perhaps a few others in the future) this
might probably work and specifically for our main interest it will
definintely work.

> > So how exactly will a package be registered in the Umegaya database.
> 
> Currently one needs to log in on upstream-metadata.d.n, and run umegaya-adm
> --register.  Alternatively, a cron job can use a similar script as you posted,
> monitor new additions, and run umegaya-adm --register.  Later, I would like to
> have a possibility to do this over the network; that what I meant by "HTTP
> interface"; I should have written "URL API".  I want the CGI script to be able
> to recieve new URLs to track.  To prevent kiddies to trick the system and make
> us upload illegal stuff in the UDD, the system would for instance decline to
> track any URL that is not in a "debian.org" domain.  Another alternative is to
> let Umegaya try to search for unknown packages in svn.debian.org and
> git.debian.org.

As I said:  I'm not convinced about this HTTP interface to the database.
I neither see any advantage for handling database administration tasks
nor to I think that fetching the data that way is very practical.
Considering the fact that you are asking for trouble security wise I'd
be even more worried about it.

> > (BTW, I keep on cut-n-pasing even the short name - could we call the
> > database the same as the file and name it upstream database? ;-))
> 
> Isn't "upstream database" too generic ?  But within the scope of this
> thread it is not a problem.

I'd be happy if we could at least use it as a "Debian Med slang" word.
On the other hand if you intent to make the system generally accepted in
Debian why not making it generic?  We are collecting data about upstream
in a database.  So what's wrong about upstream as the name (as the file
is called as well).  BTW, it came to my mind that we should also gather
fields from debian/copyright if it is DEP5 compatible.  I specifically
consider Upstream-Contact a very valuable field and at a later stage I
would even ask for a lintian check "Upstream-Contact is missing" or
something like this.

> > I did not dived into PET but as far as I know this is more what I
> > consider an automatic update driven by the data inside the VCS and I
> > wonder, whether we should not rather somehow tweak the debian/upstream
> > files into the PET mechanism.  Did you considered this?
> 
> The PET could also be a good starting point for monitoring the VCS and pinging
> Umegaya.
> 
> 
> > When thinking twice about it:  What is the sense of having this Berkeley
> > DB at all if we have UDD?  Why not importing the content of the upstream
> > files straight into UDD.  For me this somehow looks like a detour but as
> > I said I might be a bit narrow mindet on the usage on the tasks pages.
> 
> If I understand well the UDD, it is updated by reloading whole tables.

Well, this just fits for the current data which are in.  While it has
turned out as a blocker for PET this is no written law.  If you consider
that you include a Packages.gz file or a Translations-<lang>.bz file it
just makes perfectly sense to clean up the tables from the data to be
expected from this file and import the whole file.  That's a reasonable
database technique but it is no written requirement for data to be
imported into UDD (at least I have never read such a requirement).

So if you ask me how I would design a bibref importer, I would do the
following:

  1. scripts/fetch_bibref.sh
     fetches all available debian/upstream files and move them to
     /org/udd.debian.org/mirrors/upstream/package.upstream
     I would like to stress the fact that I would fetch these
     files *unchanged* as they are edited by the author
  2. udd/bibref_gatherer.py
     Just parse the upstream files for bibliographic information
     and push them into UDD
     This is the really cheap part of the job and I volunteer to
     do this in one afternoon.

My gut feeling says that I do not like the detour about another database
which is processing the data in some way.

Regarding other fields in debian/upstream:  Once step 1. is done (and as
I sadi this is the hard part of this job) it is easy to design another
table (say "CREATE TABLE upstream") where you can gather additional
data.  You might even merge debian/watch files and DEP5 information from
debian/copyright - and here you can get some training for gathering a
lot of file information.  IMHO this can be done best by unpacking all
source packages and extracting the files.  There are people who have
done such stuff in the past (the i18n data[1] are obtained by this
method).  However, regarding practical usage of these data I do not see
an application currently.  You need a problem first which needs to be
solved to invent something new.  We have the problem bibliographic
information what we can solve using debian/upstream and thus I'm very
interested.  The other information you are trying to gather does not
solve any practical problem people feel a burning issue.

> Umegaya
> is the table producer.  There could be other ways to do it, but since I am
> aiming at a system that can cope with tens of thousands of packages, I
> think that it rules out alternatives such as checking out all Alioth repositories
> everyday.

I admit that Alioth repositories do not scale in the long run.  But as I
said, once the debian/upstream file is widely accepted there will be
tools invented (and this will be probably not that hard) which put the
data in parallel to Sources.gz or Translations-<lang>.bz etc.  However,
to convince people about these files you need to make sure that it is
attractive and works and has a real practical use.

For my perception the only visible use is the bibliographic information
and I (as one of the very few users of this system - I guess I have
edited about 20% of the debian/upstream files and I think we do not have
more than 15-20 developers who ever touched debian/upstream) have
learned that the only current use on the tasks pages does not work as I
expected (== automatically).

The current view of the common DD onto debian/upstream files is either

  I don't know that those files exist.
or
  Why should I put (even duplicate) information into this file?
or
  I do not have bibliographic information so there is no additional value

So the circle of "friends of debian/upstream" is quite reduced to the
scientific field currently and we should make pretty sure that we keep
those friends on our side.  We are terribly far away from thousands of
packages featuring a debian/upstream and while I agree that it makes
sense to create a system which is proof for future use cases I think for
the moment it is even more needed to make it as smooth as possible for
those few people who have a real interest.

I also want to add that there are some rumors that people are using a
"competing" system to debian/upstream (IMHO NeuroDebian people as well)
who are providing debian/bib files[2] in *competition* to
debian/upstream.  So if we fail to implement debian/upstream in a
convincing way we even loose at least half of the "friends".  As a
convincing way I would see the *implementation* of something like

   dh_bibref

which turns debian/upstream data into a usable BibTeX database on the
users system.  This is technically definitely not hard - it just needs
to be *done*.  IMHO this has more practical relevance than a web service
which is rarely used because unknown and is providing data which needs
to be *changed* to make them usable.

> I am sorry that I kept http://upstream-metadata.debian.net in a miserable state
> this year.  I have done a lot of ground work that week-end, and the transfer of
> to debian-med.debian.net, hopefully today, will be a fresh restart.

There is no need to sorry and it is a great job you did to initialise
the debian/upstream system at all.  For my perception we now have two
tasks:

  A. Gather *all* existing debian/upstream files and making sure they
     will be updated after at least 24h at a place where they can be
     fetched for UDD (I explicitely do not mention that we should do this
     via the web service and I would really prefer not to go the detour
     of another database)
  B. Write a debian/upstream to BibTeX converter and invent some system
     to move this onto users machines (I have some ideas how to do this
     but would like to discuss this in a different thread / list).

In this thread we should focus on how to solve A effectively and
reliably for the current status of existing debian/upstream files.

Kind regards

      Andreas.

[1] http://i18n.debian.net/material/data/ 
[2] http://wiki.debian.org/DebianScience/ProblemsToWorkOn

-- 
http://fam-tille.de

Reply to:

Follow-Ups:
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>

References:
- Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>

Prev by Date: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Next by Date: [MoM] Any progress?
Previous by thread: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Next by thread: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Index(es):
- Date
- Thread