[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

WARC-file Incompatibility of the Debian web sites



WARC-files have their origins at the Internet Archive
and they are essentially a persistent hash-table in the form of

key   --- <URL the way it is in the Wild-Wild-Web>
value --- <thefile>

    http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml

the issue with the current Debian sites seems to be
that tools like the


    https://github.com/ludios/grab-site

create files like  (~30MiB)


http://temporary.softf1.com/2017/bugs/www.debian.org-devel-2016-12-28-ec5f8b13-00000.warc.gz

that fail to be viewed with a tool like the

    https://github.com/alard/warc-proxy


With the exception of large files

    https://github.com/alard/warc-proxy/issues/5

the warc-proxy actually works fine and the WARC
cration and viewing tools that I use can be downloaded from

    (~9MiB)
    http://archive.softf1.com/2016/software/2016_12_xx_WARC_tools.tar.xz

however, some sites, including the Debian web sites,
fail to be "WARC-able". It would be nice, if it were fixed,
specially given the fact that one never knows, when
something becomes censored. Please keep in mind that
there is no limit at the absurdity of censorship.
At some day photos of pigeons might be banned, because
may be some religious sect or political party finds
them offensive or otherwise endangering their ability
to keep the dumb ones working as slaves for them, paying taxes, etc.

The warc-proxy works fine with files that have a size of ~200MiB,
meaning, the aforementioned


http://temporary.softf1.com/2017/bugs/www.debian.org-devel-2016-12-28-ec5f8b13-00000.warc.gz

is not "too big".


Regards,
Martin.Vahi@softf1.com


Reply to: