[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: About Bulk PAckage Retrieval



fatih durmaz wrote...

> Is there a way to get bulk data from snapshot debian page? I need it for my
> research and I would really appreciate your help.

Note that I'm not one of those who run the snapshots service, so this
answer is as inofficial as you could think.


As I understand it, the rate limiting is a result of the problem that
requesting a particular file from the service is relatively expensive:
It's not just serving a static content right from the request path but
there is some database operation involved, and a high number of requests
brings the entire system to its limits. Possibly too many people want to
scrape the service all of the time, deliberate Denial of Service likely
exists as well.


Now if I understand your plan correctly, you want to get a full copy of
the package repository for one given date. I assume you already have a
list of the file paths, possibly by parsing the Packages index file.


The first approach I can think of works best if that date is not too far
in the past, or around a stable release. Then you could try to fetch the
files from your closest Debian package repository first. After that, you
would need the snapshots service only for the failing requests,
something that might be only a small fraction of the original request
count.

(Aside, this assumes identical basenames point to the same content, for
example that

    <debian-mirror>/debian/pool/main/b/base-files/base-files_13.2_amd64.deb

and
    https://snapshot.debian.org/archive/debian/20240503T205304Z/pool/main/b/base-files/base-files_13.2_amd64.deb

yield identical results. You may do so as this rule is violated rarely.
If you have concerns: The Packages index also has hashsums, so you can
check whether your downloaded files are correct.)


Another approach may or may not be possible. If I remember correctly,
the snapshot service stores the files using their hashsum. So if you
know the hashsum of the file you want to retrieve - again, the Packages
index has it -, you can request that file directly. Theoretically.

Problem however:

* That data directory is not accessbile via http/https. At least I never
  heard about that.
* The hash algorithm is possibly still SHA-1 while nowadays the Packages
  index only has MD5 and SHA256.

Hope that helps,

    Christoph

Attachment: signature.asc
Description: PGP signature


Reply to: