[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: FW: Accessing snapshot.debian.org packages



[moving to the list]

> I know you said that 'Public mailinglists are the right point of contact.',
> so I hope you'll forgive me for contacting you directly, but I didn't get a
> reply to my previous email and I would still like access to the Debian
> archive data if possible (see original message below).

debian-snapshot@lists.debian.org is still the best place :)

> Thanks for replying.
> 
> I'm a security researcher
[..]

> I'm creating an application to identify versions of common libraries for
> security purposes (to identify binaries that have associated CVEs). The aim
> of this project is to identify binaries that need to be patched / updated,
> so they can't be exploited.
> 
> I would want to make a large number of requests initially to populate the
> database -- I would want to download every package file for about the 100
> most popular packages for every version going back about 10-15 years. After
> that I would want to make minimal requests on a daily basis to check for new
> versions or new files for the latest version of each package.
> 
> And yes, I am aware that many people will download the source code and
> compile it themselves.

There are two parts to the snapshot thing, each with its own resource
constraints.

(a) On is everything that goes to the database.  Which is pretty much
    every request except for see (b).  Things have gotten somewhat
    better since we moved the DB for the secondary snapshot instance
    to a new host, but it's probably still not happy to be hammered.

    Things that his the database are links like
    https://snapshot.debian.org/package/postgrey/
    https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/
    https://snapshot.debian.org/archive/debian/20160816T043010Z/pool/main/p/postgrey/postgrey_1.34-1.1.dsc
    https://snapshot.debian.org/archive/debian/?year=2009&month=11
    https://snapshot.debian.org/mr/...
    etc.

    These requests are bound by database latency, and also number of
    concurrent requests to the DBMS.  Further, since the pooling class
    in use is not exactly great, once a certain number of requests are
    in flight, things just fall over and everybody starts gettings 503s.

    Don't overload the DB :)

(b) The only requests that do not hit the DB are requests to
    https://snapshot.debian.org/file/<sha1sum of file>

    Those are cheap(ish).  They are static files and apache fetches them
    directly from disk (NFS, but still).  I wouldn't worry too much
    about making a lot of them.  Maybe not concurrently, but fetching
    them fast and sustained shouldn't cause too many issues.  If things
    fail, retry slowly?

Looking at https://salsa.debian.org/snapshot-team/snapshot/raw/master/API,
you'll probably need to make some requests to learn which files to
download.  Please do those one at a time, and spread them out?  I don't
know what a reasonable rate is that still lets you get what you need.
How many requests do you think it'll need?

I guess there will be requests to /mr/package/<package>/ for "the 100
most popular packages", so that's reasonably small.  And then maybe
/mr/package/<package>/<version>/allfiles to learn all about the files?
So that'd be once per package per version.  Any guess how many that'd
be?  10k  100k?  How many versions does the average "popular package"
have?  A request every few seconds should still get you what you need in
a reasonable time?

Once you have that info, it should all just be file downloads?





> Since I work for a large company, I do have access to resources, including
> funding to cover expenses (for hardware, travel, services etc.), if that
> helps. I hope that my indirect reference to money doesn't seem
> inappropriate, I was just thinking there might be a way to access the data
> that would incur some costs.

I'm not sure that throwing money at the problem currently would help
much.  It's mainly a manpower issue as few (if any) other people ever
look after snapshot and I don't really have any time for it either.

-- 
                            |  .''`.       ** Debian **
      Peter Palfrader       | : :' :      The  universal
 https://www.palfrader.org/ | `. `'      Operating System
                            |   `-    https://www.debian.org/


Reply to: