[moving to the list]
I know you said that 'Public mailinglists are the right point
of contact.',
so I hope you'll forgive me for contacting you directly, but I
didn't get a
reply to my previous email and I would still like access to the
Debian
archive data if possible (see original message below).
debian-snapshot@lists.debian.org is still the best place :)
Thanks for replying.
I'm a security researcher
[..]
I'm creating an application to identify versions of common
libraries for
security purposes (to identify binaries that have associated
CVEs). The aim
of this project is to identify binaries that need to be patched
/ updated,
so they can't be exploited.
I would want to make a large number of requests initially to
populate the
database -- I would want to download every package file for
about the 100
most popular packages for every version going back about 10-15
years. After
that I would want to make minimal requests on a daily basis to
check for new
versions or new files for the latest version of each package.
And yes, I am aware that many people will download the source
code and
compile it themselves.
There are two parts to the snapshot thing, each with its own
resource
constraints.
(a) On is everything that goes to the database. Which is pretty
much
every request except for see (b). Things have gotten
somewhat
better since we moved the DB for the secondary snapshot
instance
to a new host, but it's probably still not happy to be
hammered.
Things that his the database are links like
https://snapshot.debian.org/package/postgrey/
https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/
https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/postgrey_1.34-1.1.dsc
https://snapshot.debian.org/archive/debian/?year=2009&month=11
https://snapshot.debian.org/mr/...
etc.
These requests are bound by database latency, and also
number of
concurrent requests to the DBMS. Further, since the pooling
class
in use is not exactly great, once a certain number of
requests are
in flight, things just fall over and everybody starts
gettings 503s.
Don't overload the DB :)
(b) The only requests that do not hit the DB are requests to
https://snapshot.debian.org/file/<sha1sum of file>
Those are cheap(ish). They are static files and apache
fetches them
directly from disk (NFS, but still). I wouldn't worry too
much
about making a lot of them. Maybe not concurrently, but
fetching
them fast and sustained shouldn't cause too many issues. If
things
fail, retry slowly?
Looking at
https://salsa.debian.org/snapshot-team/snapshot/raw/master/API,
you'll probably need to make some requests to learn which files
to
download. Please do those one at a time, and spread them out?
I don't
know what a reasonable rate is that still lets you get what you
need.
How many requests do you think it'll need?
I guess there will be requests to /mr/package/<package>/ for
"the 100
most popular packages", so that's reasonably small. And then
maybe
/mr/package/<package>/<version>/allfiles to learn all about the
files?
So that'd be once per package per version. Any guess how many
that'd
be? 10k 100k? How many versions does the average "popular
package"
have? A request every few seconds should still get you what you
need in
a reasonable time?
Once you have that info, it should all just be file downloads?