[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: http://ftp-master.debian.org/new/ not accessible any more



Hi Ansgar,

On Thu, May 21, 2015 at 11:59:39AM +0200, Ansgar Burchardt wrote:
> 
> I don't know exactly why this changed (maybe different default in
> Apache?),

Probably - I just wanted to know if this is an *intentional* change or
by accident and will be reverted.

> but scraping web pages seems a suboptimal way to gather
> information.
>
> There is [1] with machine-readable information about packages in NEW.

I agree to this but there is no sufficient information in a machine
readable (if you do not consider html as machine readable) format.
When I wrote the machine readable gatherer it was discussed to create
single <package>-<version>.822 files but this was never the case.
(On the contrary the gather has a never used feature to export those
single .822 files.)

The patch below is able to cope with the new situation but before I
activate it it would be nice to have some confirmation that the latest
change will be permanent.  Hmmm, may be I commit it anyway since it
serves basically the same purpose and is safe against similar changes
in future.

Kind regards

        Andreas.

>   [1] <https://ftp-master.debian.org/new.822>



$ git diff
diff --git a/scripts/fetch_ftpnew.sh b/scripts/fetch_ftpnew.sh
index 3acb421..fd26f1f 100755
--- a/scripts/fetch_ftpnew.sh
+++ b/scripts/fetch_ftpnew.sh
@@ -4,9 +4,6 @@ mkdir -p $TARGETDIR
 rm -rf $TARGETDIR/*
 wget -q http://ftp-master.debian.org/new.822 -O ${TARGETDIR}/new.822
 cd $TARGETDIR
-wget -q -r -N --level=2 --no-parent --no-directories http://ftp-master.debian.org/new/
-# Some large packages do contain e huge list of files which just consumes space in our
-# cache - so simply delete these entries which are of no use here
-#  sed -i '/^[-dlrwx]\+ root\/root/d' ${TARGETDIR}/*.html
-# Finally it might be better to keep originals ...
-rm -f $TARGETDIR/index.html*
+for newhtml in `wget -q -O- http://ftp-master.debian.org/new.html | grep '^<a href="new/.*\.html' | sed 's?^<a href="\(new/.*\.html\).*?http://ftp-master.debian.org/\1?'` ; do
+    wget -q $newhtml
+done
 

-- 
http://fam-tille.de


Reply to: