Re: Bug#752384: HEAnet sourceforge mirror is outdated

To: debian-qa@lists.debian.org
Subject: Re: Bug#752384: HEAnet sourceforge mirror is outdated
From: Stuart Prescott <stuart@debian.org>
Date: Mon, 21 Jul 2014 21:08:10 +1000
Message-id: <[🔎] lqisau$53r$1@ger.gmane.org>
References: <[🔎] 53BA815E.8030203@serverb.co.uk> <[🔎] 1404732475.12383.19.camel@chianamo> <[🔎] 53BBB93B.3080406@serverb.co.uk> <[🔎] CAKTje6H-A0R4zLvETOTG4u+-yz6zgdD_Wre=j0mLhmO2+KdnEw@mail.gmail.com> <53CCE53E.90304__40260.4936193258$1405937187$gmane$org@serverb.co.uk>

Hi Daniel,

many thanks for your work on this!

> It should definitely be possible to add a caching mechanism to the the
> new redirector, currently I have a couple of ideas on this but both have
> drawbacks.
> 
> 1. Use a Berkeley DB to store the retrieved data, similar to what is
> currently done.

BDB may not be a particularly good choice either [1] -- there are other DBs 
suggested in that thread that seem to have a better future.

https://lists.debian.org/debian-devel/2014/06/msg00328.html

> Problems I foresee:
> * My intention would be to check at the time the script is requested if
> the update_time > 1 hour ago...
> - if yes... get the new RSS and update the DB
> - if no.... use the information from the DB
> 
> What happens if this happens for multiple requests at the same time?

If the update of the db is atomic (which is easy to arrange for most DBs), 
then I'd be tempted to ignore that you might very occasionally request the 
same RSS feed twice in quick succession. Others may disagree with me here... 
but I'd worry about that later if it is actually a problem.

An alternative to worrying about locking would be to only update the db from 
cron. This adds some latency to the scan which is annoying for the 
maintainer sometimes but not really an issue from the QA perspective. If the 
RSS feed offers a Last-Modified header for HEAD requests, then the cron job 
can be done easily and often (perhaps that should be investigated anyway?).

> 2. Save the XML file to a cache folder
> 
> Then at request time check the time on that file and it's age.
> 
> The only problem I can see this causing is disk space (I don't know how
> much of an issue this is for Debian)

Rather than keeping all of them all the time, you could delete them as soon 
as they are older than the refresh time; that could be done with find from 
cron. If we were to split the QA cron jobs that check for outdated sources 
across the day, it would be easy to keep that number down to 10-20% of the 
total XML.

> The RSS file for the VPCS project is almost 52KB. Picking a figure out
> of the air (as I've no idea how packages use the redirector) of 10000,
> this is going to create 520MB of cached files.

Just to help with one of the two random numbers:

udd=> select count(*) from upstream where watch_file like '%sf.net/%';
 count 
-------
  2409
(1 row)


hope that helps!

cheers
Stuart


-- 
Stuart Prescott    http://www.nanonanonano.net/   stuart@nanonanonano.net
Debian Developer   http://www.debian.org/         stuart@debian.org
GPG fingerprint    90E2 D2C1 AD14 6A1B 7EBB 891D BBC1 7EBB 1396 F2F7

Reply to:

References:
- Bug#752384: HEAnet sourceforge mirror is outdated
  - From: Daniel Lintott <daniel@serverb.co.uk>
- Bug#752384: HEAnet sourceforge mirror is outdated
  - From: Paul Wise <pabs@debian.org>
- Bug#752384: HEAnet sourceforge mirror is outdated
  - From: Daniel Lintott <daniel@serverb.co.uk>
- Bug#752384: HEAnet sourceforge mirror is outdated
  - From: Paul Wise <pabs@debian.org>

Prev by Date: Bug#752384: HEAnet sourceforge mirror is outdated
Next by Date: Bug#752384: HEAnet sourceforge mirror is outdated
Previous by thread: Bug#752384: HEAnet sourceforge mirror is outdated
Next by thread: Bug#754110: Please set autofocus= on the search input box
Index(es):
- Date
- Thread