fatih durmaz wrote... > Is there a way to get bulk data from snapshot debian page? I need it for my > research and I would really appreciate your help. Note that I'm not one of those who run the snapshots service, so this answer is as inofficial as you could think. As I understand it, the rate limiting is a result of the problem that requesting a particular file from the service is relatively expensive: It's not just serving a static content right from the request path but there is some database operation involved, and a high number of requests brings the entire system to its limits. Possibly too many people want to scrape the service all of the time, deliberate Denial of Service likely exists as well. Now if I understand your plan correctly, you want to get a full copy of the package repository for one given date. I assume you already have a list of the file paths, possibly by parsing the Packages index file. The first approach I can think of works best if that date is not too far in the past, or around a stable release. Then you could try to fetch the files from your closest Debian package repository first. After that, you would need the snapshots service only for the failing requests, something that might be only a small fraction of the original request count. (Aside, this assumes identical basenames point to the same content, for example that <debian-mirror>/debian/pool/main/b/base-files/base-files_13.2_amd64.deb and https://snapshot.debian.org/archive/debian/20240503T205304Z/pool/main/b/base-files/base-files_13.2_amd64.deb yield identical results. You may do so as this rule is violated rarely. If you have concerns: The Packages index also has hashsums, so you can check whether your downloaded files are correct.) Another approach may or may not be possible. If I remember correctly, the snapshot service stores the files using their hashsum. So if you know the hashsum of the file you want to retrieve - again, the Packages index has it -, you can request that file directly. Theoretically. Problem however: * That data directory is not accessbile via http/https. At least I never heard about that. * The hash algorithm is possibly still SHA-1 while nowadays the Packages index only has MD5 and SHA256. Hope that helps, Christoph
Attachment:
signature.asc
Description: PGP signature