[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#963887: UDD: 'duck' importer broken since 2020-05-25



On 30/06/20 at 09:19 +0200, Baptiste BEAUPLAT wrote:
> On 6/29/20 11:34 PM, Raphael Hertzog wrote:
> > On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote:
> >>> Indeed, creating a dedicated service for this does not seem a good idea.
> >>
> >> I would love to have this feature integrated directly with
> >> distro-tracker. However, I'm wondering about the load that would case
> >> for the service.
> > 
> > Network request do not generate much "load", such processes spend the bulk
> > of their time waiting on the network.
> 
> True that.
> 
> >> The duck worker has to process around 460000 urls (only counting
> >> Homepage) in less than 24h.
> > 
> > How do you get to that figure? We don't have that many source package
> > and even if you consider multiple URL for each source package due to
> > changes over time (in multiple releases), that makes way too many URLs
> > per source package.
> 
> Err, sorry about that. That figure is the result of:
> 
> $ curl -s
> http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
> zgrep -v Homepage: | sort -u | wc -l
> 458804
> 
> Which is obviously wrong. Here is the real number:
> 
> $ curl -s
> http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
> zgrep Homepage: | sort -u | wc -l
> 26250
> 
> >> I'm not sure that can done properly using
> >> the distro-tracker tasks (parallel workers are needed to work around
> >> timeout). Obviously that can be optimized (different check delay for
> >> different results) but that's still bulk network related tasks.
> > 
> > Nothing forbids parallel workers and in any case, I welcome any
> > improvement to the task mechanism to make that kind of parallelism easier
> > to handle.
> > 
> > There are other tasks that could benefit from this (and in general I want
> > to merge more of such features in distro-tracker to make them available to
> > derivatives too).
> 
> Then, let's add this to distro-tracker :)
> 
> I've created an issue on the project on salsa so we can discuss
> technical details :
> 
> https://salsa.debian.org/qa/distro-tracker/-/issues/51
> 
> As I've said before, I would like to finish up on a couple of other
> projects (namely mentors.d.n and snapshot.d.o) and I will be available
> right after that.

Hi,

I don't really want to push for it (doing it into distro-tracker and
then importer the data into UDD is fine), but another alternative would
be to include this directly into UDD, similarly to what is done for the
'upstream' importer that checks debian/watch using uscan.

It would boils down to:

1) identify the URLs that need to be check:

select distinct homepage
from (select homepage from sources union select homepage from packages) t;

Or maybe better:
select distinct homepage
from (
   select homepage from sources where release in ('sid', 'experimental')
   union select homepage from packages where release in ('sid','experimental')
) t;

2) populate/update a table with:
(url, last_check_timestamp, status, detailed_status)
(obviously, with whatever policy is needed about retries/refreshes)

3) export the data (for example as a JSON file) so that it can be used
by other services

Lucas

Attachment: signature.asc
Description: PGP signature


Reply to: