[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: overly short max-age of archive -> file redirect



On Wed, 2024-11-06 at 09:40 +0100, Felix Moessbauer wrote:
> On Wed, 2024-11-06 at 07:19 +0100, Linus Nordberg wrote:
> > "MOESSBAUER, Felix" <felix.moessbauer@siemens.com> wrote
> > Tue, 5 Nov 2024 09:09:21 +0000:
> > 
> > > Further, the rate-limits should be precisely documented so
> > > clients
> > > /
> > > caching proxies can adapt to this. The limits also need to match
> > > the
> > > retry-after header in the 429 responses. Currently s.d.o responds
> > > with
> > > retry-after 5 (seconds), which is by far to short to overcome the
> > > limit.
> > 
> > Thanks for reporting this.
> > 
> > I've been planning on documenting the rate limiting somewhere once
> > it
> > seems to behave reasonably well. It was added in a bit of a haste
> > last
> > weekend.
> > 
> > One thing I would hesitate to put in a document somewhere for a
> > human
> > to
> > read once and then base their implementation on is numbers that are
> > likely to change. What do you think about the server setting
> > X-RateLimit-Remaining in responses?
> 
> Hi,
> 
> that would be even better. In [1] there are more headers regarding
> rate
> limiting which could be considered. Maybe check these as well.
> 
> [1]
> https://www.ietf.org/archive/id/draft-polli-ratelimit-headers-02.html
> 
> > 
> > The Retry-After has been fixed in [this patchset][] which I will
> > try
> > to
> > get merged in a day or two.
> 
> Thanks! Currently apt does not obey this, but I'll try to add
> support.

Hi,

I'm currently working on supporting this in ...
Feel free to involve yourself. Especially the needed (random) request
distribution seems to be problematic from apt perspective, but needed
to support rate-limitings.

https://salsa.debian.org/apt-team/apt/-/merge_requests/383

> Currently apt just retries with exponential backoff (1,2,4,...) secs
> and most people just recommend a max of 3 retries. Hence, this does
> not
> help at all to overcome the limit. But lets fix things one-by-one.
> 
> > 
> > [this patchset]:
> > https://salsa.debian.org/linus/dsa-puppet/-/compare/production...linus%2Fsnapshot-ratelimit?from_project_id=2990
> > 
> > 
> > > If rate-limiting would be implemented correctly, downstream
> > > caches
> > > could properly cache the results and clients like apt could
> > > behave
> > > nicely. I further recommend to use WAY higher request limits in
> > > combination with a moving average limit on the amount of
> > > transferred
> > > data. By that, the cheap "is my cache still valid" requests could
> > > pass,
> > > while the more heavy payload transfers are avoided. Also clients
> > > could
> > > hit s.d.o without reduced transfer rates, hence reducing the
> > > amount
> > > of
> > > open handles on the server.
> > 
> > What do you think would be a reasonable request limit?
> 
> That depends on what your server can handle. It's been 5 years since
> I
> worked in that field, so take the following with a grain of salt:
> 
> In general, only as few as possible requests should reach the flask
> application, as it is not suited for high loads. Running it under
> gunicorn or another executor instead of direct improve things a bit
> as
> you get multithreading. Anyways, the most important thing is to
> return
> correct cache headers from the flask app, so Varnish can cache them.

After re-visiting this, I noticed that probably the ONLY missing piece
in a performant snapshot.d.o is providing meaningful cache headers for
the redirects to static files on /archive. Currently these redirects
are served with a 302 and 600s max-age. By serving these with a 301 and
either no or a long max-age this should be fixed. The corresponding
code is here:
https://salsa.debian.org/snapshot-team/snapshot/-/blob/master/web/app/snapshot/views/archive.py?ref_type=heads#L96

As the rate-limit is already only applied to cache misses (from Varnish
perspective) most clients (and especially repeated builds) are not
expected to run into the limits at all.

> These cache hit rates should be monitored and should be close to 100%
> during normal operation (except for the intentionally direct calls).
> In
> contrast, large files (e.g. large / all debs) should be served
> directly
> (bypassing the cache and ideally flask). For details, please also see
> [2].
> Finally, the storage backend needs to be fast - and by that not a
> fuse.

I just noticed, that this is already implemented (after finding the
corresponding config file...). That's great news:
https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/production/modules/roles/templates/snapshot/snapshot.debian.org.conf.erb?ref_type=heads#L82

Best regards,
Felix

> 
> Once all that is implemented, likely the limits can be much higher
> without risking DoS.
> 
> [2]
> https://tedboy.github.io/flask/generated/flask.send_from_directory.html
> 
> > 
> > We're currently limiting number of requests per time unit since
> > that's
> > what killed one of the servers last week. I haven't looked into how
> > to
> > limit the amount of transferred data yet.
> 
> Having this limit applied per IP is unfortunate for big companies
> (like
> Siemens), as all user are funneled via a couple of egress IPs.
> 
> Best regards,
> Felix Moessbauer 
> 

-- 
Siemens AG, Technology
Linux Expert Center



Reply to: