[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically



Hi Andreas and Yaroslav,

I started to write an answer this morning, but I can not keep up the rythm of
the discussion !  Below I wrap up what I drafted, and the summary is that I
will work on gathering the debian/upstream files in a single VCS on
collab-maint as well.

I will then focus on the other points of the disucsion, in particular the
possible integration with Debhelper and the support of multiple bibliographic
references.

Cheers,

-- Charles

Le Mon, Feb 20, 2012 at 10:36:12PM +0100, Andreas Tille a écrit :
> 
> I did perfectly understood this but if you want to have a
> debian/upstream file in every Debian package you finally need to relay
> on uploaded packages.  The assumption that all packages are maintained
> in any form of Vcs (=reasonable team maintained or at least
> maintainable) is simply wrong even for widely used packages like r-base
> for instance.  So a general solution can not be based on Vcs status of
> packages.

Yes, there is a contradiction.  But note that already 63 % of our source
packages that are managed in a VCS.  If others have the interest, time and
energy to make a use of debian/upstream from the uploaded packages, no problem
with me.  But I think that going through VCSes is the way.  Therefore, because
debian/upstream is an optional file, I see nothing wrong in covering only
packages that are maintained in a VCS.  For the moment it has not been a
limitation for us.

The alternative is to store the data in a file outside the package.  This is
what we do with our tasks files and the price to pay is that it is very
difficult to manage the package lists.  We could have an accessory file in
collab-maint for instance.  Or a repository of debian/upstream files.  Would
you be interested if I try to set this up ?  Then you can have a cron job that
does "git pull" every day, and voilà, you can process data as you like where
you want.


> Moreover I guess this is also a weak point of your system.  Any kid
> could break your sever by pushing it to trigger a whole lot of updates
> and kill your server.  So you seem to maintain a server which is on
> one hand easy to kill but not used in practice.

This is why there is a "delay" option, which inhibits the refreshing if the
record is younger that an configurable number of seconds, for the moment 60.
An attacker who would like to use upstream-metadata.debian.net to indirectly
load Alioth would need to use a rotating list of existing source package names.
Morover, this indirection would not make him more anonymous as Apache is making
its standards logs on every query.  Altogerther, I regard this as completely
unlikely.

> > for package in $(svn cat svn://svn.debian.org/blends/projects/med/trunk/debian-med/debian/control | grep Recommends | sed -e 's/,//g' -e 's/|//g' -e 's/Recommends://g' ); do curl http://upstream-metadata.debian.net/$package/Name ; done
> > 
> Hmmm, bibref gatherer just runs into 404 error - no way for me to check
> the success of this operation.

I think that the new DNS did not finish to propagate at that time.  Now it
should work.  I also ran a similar command for Suggests field this morning.

 
> But if we
> just assume a simple cron job doing something similar but in the end
> creating a downloadable tarball and move this to some http-accessible
> place (or even a fake Vcs repository - whatever might come to mind)
> there are perfectly good chances to collect all upstream files.  In
> some way you are fetching the files as well - so why should this only
> work on udd.debian.org?  None of the steps above will create a high
> workload on udd.debian.org.

This misses the discovery of debian/upstream files for draft packages, which we
need to move the upstream metadata out of the Blends task files.  We can do
this for a couple of teams, but not for the whole archive, as it would mean
scanning thousands of repositories everyday.  We need a way for developers to
inform that they created a new debian/upload file, and a strategy to make sure
that this information is not too hard to propagate.

 
> > Regardless the mean, I provide a table that can be downloaded daily and that
> > can be loaded in the UDD.  That is how the gatherers work as I have seen so
> > far.  That the data transits in a Berkeley DB is just a detail.  It is as
> > unimportant as having the data processed with one programming language or
> > another.  What matters is the final product, the table to be loaded.
> 
> I tend to disagree.  The data are usually gathered in a possibly raw
> format without any intermediate step of a third party database.  While I
> could in principle live with this intermediate step I totally fail to
> see any reason for this extra step of complexity.

I will make a repository of debian/upstream files as you propose.  But I still
think that there is a value of processing this data to make a UDD table that is
limited to bibliographic information.  The design of debian/upstream is open,
and if we would load all the contents in a single table, every new field, and
every typo in field names, would create a new column of very sparse data.  That
still does not make a big table from postgresql point of view, but on join
queries, wouldn't that make things too heavy ?


Reply to: