Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically

To: debian-med@lists.debian.org
Subject: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
From: Andreas Tille <andreas@an3as.eu>
Date: Tue, 21 Feb 2012 17:23:47 +0100
Message-id: <[🔎] 20120221162347.GI14395@an3as.eu>
In-reply-to: <[🔎] 20120221150359.GA18783@falafel.plessy.net>
References: <[🔎] 20120217221626.GB8823@an3as.eu> <[🔎] 20120218152517.GA28208@falafel.plessy.net> <[🔎] 20120218155340.GE20647@an3as.eu> <[🔎] 20120219095216.GA15125@falafel.plessy.net> <[🔎] 20120219141319.GD21819@an3as.eu> <[🔎] 20120220010120.GD30636@falafel.plessy.net> <[🔎] 20120220084309.GA10901@an3as.eu> <[🔎] 20120220160051.GA1136@falafel.plessy.net> <[🔎] 20120220213612.GF20898@an3as.eu> <[🔎] 20120221150359.GA18783@falafel.plessy.net>

Hi Charles,

On Wed, Feb 22, 2012 at 12:03:59AM +0900, Charles Plessy wrote:
> I started to write an answer this morning, but I can not keep up the rythm of
> the discussion !

:-)

> I will then focus on the other points of the disucsion, in particular the
> possible integration with Debhelper and the support of multiple bibliographic
> references.

This was intentionally moved to a different thread anyway.

> Le Mon, Feb 20, 2012 at 10:36:12PM +0100, Andreas Tille a écrit :
> > 
> > I did perfectly understood this but if you want to have a
> > debian/upstream file in every Debian package you finally need to relay
> > on uploaded packages.  The assumption that all packages are maintained
> > in any form of Vcs (=reasonable team maintained or at least
> > maintainable) is simply wrong even for widely used packages like r-base
> > for instance.  So a general solution can not be based on Vcs status of
> > packages.
> 
> Yes, there is a contradiction.  But note that already 63 % of our source
> packages that are managed in a VCS.  If others have the interest, time and
> energy to make a use of debian/upstream from the uploaded packages, no problem
> with me.

I've got the "the does decides" argument.  However, my argument was just
that if you want a 100% coverage you can trust that the means to handle
this standard will come along with this.  So it is fine to stick to the
current situation as is for our current implementation where whe have
less than 1% coverage.

> But I think that going through VCSes is the way.  Therefore, because
> debian/upstream is an optional file, I see nothing wrong in covering only
> packages that are maintained in a VCS.  For the moment it has not been a
> limitation for us.

I perfectly agree with the "fetch from VCS" procedure.  I just wanted to
prove my point that we can also fetch debian/upstream files by just
fetching the complete VCSes of teams in question.  This scales for the
moment perfectly and makes sure we get really *all* files we want to
fetch.

> The alternative is to store the data in a file outside the package.  This is
> what we do with our tasks files and the price to pay is that it is very
> difficult to manage the package lists.

No.  Just lets keep debian/upstream where we have them.  They are well
placed there and useful as they are.

> We could have an accessory file in
> collab-maint for instance.  Or a repository of debian/upstream files.  Would
> you be interested if I try to set this up ?  Then you can have a cron job that
> does "git pull" every day, and voilà, you can process data as you like where
> you want.

Just to make sure I understood the suggestion correctly:  You want to
create a Git repository keeping *copies* of the debian/upstream files
which are stored currently in VCSes of packages?  Yes, this would very
easily solve my problem to gather the original information for UDD.
Something like a directory layout

     <packagename>/upstream

or also

     <packagename>.upstream

in a Git repository would be a very cool thing and very easy to fetch
for UDD.  I'd love to see this available soonish.

> > Moreover I guess this is also a weak point of your system.  Any kid
> > could break your sever by pushing it to trigger a whole lot of updates
> > and kill your server.  So you seem to maintain a server which is on
> > one hand easy to kill but not used in practice.
> 
> This is why there is a "delay" option, which inhibits the refreshing if the
> record is younger that an configurable number of seconds, for the moment 60.
> An attacker who would like to use upstream-metadata.debian.net to indirectly
> load Alioth would need to use a rotating list of existing source package names.
> Morover, this indirection would not make him more anonymous as Apache is making
> its standards logs on every query.  Altogerther, I regard this as completely
> unlikely.

I admit that this is unlikely.  However, my problem is, that you are
spending a certain amount of time in creating a database - webservice
combination for a use which is totally unclear to me.  I agree that
we need to gather the information in some way but I just fail to
understand your motivation to do it in this rather complex way.

> > > for package in $(svn cat svn://svn.debian.org/blends/projects/med/trunk/debian-med/debian/control | grep Recommends | sed -e 's/,//g' -e 's/|//g' -e 's/Recommends://g' ); do curl http://upstream-metadata.debian.net/$package/Name ; done
> > > 
> > Hmmm, bibref gatherer just runs into 404 error - no way for me to check
> > the success of this operation.
> 
> I think that the new DNS did not finish to propagate at that time.  Now it
> should work.  I also ran a similar command for Suggests field this morning.

I've just run the gatherer again and now I get at least information
about 66 package (out of 95 potential ones which I have counted to be
avialable in VCSes) including information about cufflinks but excluding
proftmb.  So it is better but not yet solved.  Do you see any chance for
me to convince you diving straight into VCSes? ;-)

> This misses the discovery of debian/upstream files for draft packages, which we
> need to move the upstream metadata out of the Blends task files.

For the moment my suggestion is completely orthogonal to the Blends
tasks files.  The specification of a team VCS it sompletely sufficient.
Moreover there might be some more clever ways to check out upstream
files form VCSes without checking out the whole archive - I bet some
clever SVN/Git/whatever expert could drastically optimize the simple
proof of concept code I provided.

> We can do
> this for a couple of teams, but not for the whole archive, as it would mean
> scanning thousands of repositories everyday.

I guess there is a good chance of optimisation possible if we consider
checking changelogs for the string debian/upstream or whatever clever
idea might come to mind.  Unfortunately I'm no expert in this but I'd
regard the effort as potentially low enough to get new upstream files
once per day.  The idea to use commit hooks was flashing around in this
thread as well.

> We need a way for developers to
> inform that they created a new debian/upload file, and a strategy to make sure
> that this information is not too hard to propagate.

My experience is that it is a bad idea to ask people to do more work
than necessary.  They will just fail / refuse to remember / do.  We need
an automatic solution to get the data reliably and I'm pretty optimistic
that this is perfectly possible if we spend some time into it how this
can be done the best way.

> I will make a repository of debian/upstream files as you propose.

Great!  This will be perfectly helpful.  We can decide later by what
means we will update this repository sanely and just base all the other
work which needs to be done 

> But I still
> think that there is a value of processing this data to make a UDD table that is
> limited to bibliographic information.

No problem with this.  If we need to move other upstream information
into UDD we can use a different table.  I will not change this for the
moment.  (I just pronounced the idea in the other thread that I see
more potential use cases.)

> The design of debian/upstream is open,
> and if we would load all the contents in a single table, every new field, and
> every typo in field names, would create a new column of very sparse data.  That
> still does not make a big table from postgresql point of view, but on join
> queries, wouldn't that make things too heavy ?

Using a positive list of keys would make sense anyway to prevent
misspellings.  I actually intend to do this and issue warnings if I find
unknown keys.  BTW, a lintian check for those cases comes to mind.

Kind regards

       Andreas.

-- 
http://fam-tille.de

Reply to:

Follow-Ups:
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>

References:
- Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Andreas Tille <andreas@an3as.eu>
- Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
  - From: Charles Plessy <plessy@debian.org>

Prev by Date: Re: Turning debian/upstream into BibTeX (Was: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically)
Next by Date: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Previous by thread: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Next by thread: Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically
Index(es):
- Date
- Thread