[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: FW: Accessing snapshot.debian.org packages



Hello,

On dv., jul. 05 2019, Peter Palfrader wrote:

[moving to the list]

I know you said that 'Public mailinglists are the right point of contact.', so I hope you'll forgive me for contacting you directly, but I didn't get a reply to my previous email and I would still like access to the Debian
archive data if possible (see original message below).

debian-snapshot@lists.debian.org is still the best place :)

Thanks for replying.

I'm a security researcher
[..]

I'm creating an application to identify versions of common libraries for security purposes (to identify binaries that have associated CVEs). The aim of this project is to identify binaries that need to be patched / updated,
so they can't be exploited.

I would want to make a large number of requests initially to populate the database -- I would want to download every package file for about the 100 most popular packages for every version going back about 10-15 years. After that I would want to make minimal requests on a daily basis to check for new
versions or new files for the latest version of each package.

And yes, I am aware that many people will download the source code and
compile it themselves.

There are two parts to the snapshot thing, each with its own resource
constraints.

(a) On is everything that goes to the database. Which is pretty much every request except for see (b). Things have gotten somewhat better since we moved the DB for the secondary snapshot instance to a new host, but it's probably still not happy to be hammered.

    Things that his the database are links like
    https://snapshot.debian.org/package/postgrey/
    https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/
    https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/postgrey_1.34-1.1.dsc
    https://snapshot.debian.org/archive/debian/?year=2009&month=11
    https://snapshot.debian.org/mr/...
    etc.

These requests are bound by database latency, and also number of concurrent requests to the DBMS. Further, since the pooling class in use is not exactly great, once a certain number of requests are in flight, things just fall over and everybody starts gettings 503s.

    Don't overload the DB :)


This is very good to know; in a previous email of mine to this list I mentioned it should be noted on the documentation for the API, maybe a link to your email will be very desirable in addition to that.


(b) The only requests that do not hit the DB are requests to
    https://snapshot.debian.org/file/<sha1sum of file>

Those are cheap(ish). They are static files and apache fetches them directly from disk (NFS, but still). I wouldn't worry too much about making a lot of them. Maybe not concurrently, but fetching them fast and sustained shouldn't cause too many issues. If things
    fail, retry slowly?

Looking at https://salsa.debian.org/snapshot-team/snapshot/raw/master/API, you'll probably need to make some requests to learn which files to download. Please do those one at a time, and spread them out? I don't know what a reasonable rate is that still lets you get what you need.
How many requests do you think it'll need?

I guess there will be requests to /mr/package/<package>/ for "the 100 most popular packages", so that's reasonably small. And then maybe /mr/package/<package>/<version>/allfiles to learn all about the files? So that'd be once per package per version. Any guess how many that'd be? 10k 100k? How many versions does the average "popular package" have? A request every few seconds should still get you what you need in
a reasonable time?

Once you have that info, it should all just be file downloads?


Mostly, there is also /mr/file/<hash>/info to know the file name, size and first-seen date; from my tests IIRC 3-4 requests are needed to have full information to actually download a file or to decide if it is to be downloaded.


Since I work for a large company, I do have access to resources, including funding to cover expenses (for hardware, travel, services etc.), if that
helps. I hope that my indirect reference to money doesn't seem
inappropriate, I was just thinking there might be a way to access the data
that would incur some costs.

I'm not sure that throwing money at the problem currently would help much. It's mainly a manpower issue as few (if any) other people ever look after snapshot and I don't really have any time for it either.


Not necessarily wrong either, and roughly on the side of what I also mentioned on a previous email:

Another option, which may not be feasible, would be to make the db available for download and give people the ability to process that
on their own; is a db dump (without the packages) huge?


If this were practicable, even if the dump only happened *somewhat* often (once a week? once a month? depends on the data), it'd allow people/organisations to, e.g. locally replicate the API service (DB included) and only hit snapshot.debian.org for file downloads if absolutely necessary and not already cached.

That would enable some use-cases and also allow people without access to snapshot.debian.org to contribute to improving the service; e.g. by modifying the software contacting the database without having the whole archive locally.

If this were not practicable and someone needs to be running these kind of analysis, it could be possible to dump the database to hard drives that are sent around by postal service/personally; sadly this would require some manual intervention and therefore can't happen too often since all volunteers' time is limited.

Both of these options, and probably others I'm not seeing, can only be done with some resources, so I don't think the mention of resource availability is out of order.

All of this, and what I mentioned in my previous email is because the data in snapshot.debian.org has huge potential, but it also currently has a very high barrier of entry.
--
Evilham


Reply to: