[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Snapshot behind Fastly; roles and responsibilities



Hi,

On 11/18/24 11:38 AM, MOESSBAUER, Felix wrote:
> On Mon, 2024-11-18 at 10:35 +0100, Linus Nordberg wrote:
>> Hi all,
>>
>> Snapshot is behind Fastly since Sunday Nov 17 2024. I think that's
>> bad
>> and would like to change that. It's bad in the short term since we
>> expose user data to a third party. It's bad in the long term since
>> the
>> short term bad won't go away until we learn how to deal with web
>> traffic.
> 
> That's a trade off between the advantages of a CDN and privacy.
> For me as snapshot user that needs it to build reproducible things in
> CI systems, the most important aspect is reliability and performance.

That's also how I see it. We need a way to ban entire ASes from Debian
infrastructure, as long as they keep sending abusive requests from a
very large amount of IP addresses.

While I think we should make sure that we can keep up with a high amount
of requests (which probably requires pgbouncer and some other fixes),
serving the ridiculous amount of scraping sent by Tencent without
coordination or backoff is not helpful.

I hacked together something to collect data from BGP and I could go and
put stuff into an ipset to block on - but Fastly made that ridiculously
easy. And Tencent shot at snapshot-master (sallinen) the day before,
which was easy to shield off.

Note that a lot of traffic to snapshot is HTTP - and you are traversing
the world to get to the target host - and thus the privacy bits are
already very low. We are also not serving user data here, only known
public bits.

>> I have not been able to solve the problem with more incoming HTTP
>> traffic than what the snapshot setup comfortably can deal with.
>> Partly
>> because I'm not very knowledgeable in this field and partly because I
>> have not been given enough access to the cache layer(s).

My hope is that with Fastly in the path it's easier to open up that log.
Technically we still have a mix of Fastly and whatever goes to
snapshot-mlm-01 directly, but maybe that is fine.

> I also had a look at this topic (mostly based on code-review) and
> identified a couple of problems:
> 
> 1. apt behaves badly on 429 TooManyRequests. Addressed in [1]

I think investments into apt retry logic are the most important.
Individual failures should be retried sensibly, as we cannot guarantee
100% success rate.

> 2. Expensive redirects to farm (DB lookup!) are cached too short.
> Addressed in [2], also affected by [3]

Varnish has been really badly behaving with the "file" backend. We kept
shooting down objects continuously. It looks like the backend is
effectively unsupported, everyone is supposed to use Varnish Enterprise
with the corresponding storage engine if you want any notion of
persistence. Varnish Open Source implements cleaning the cache on the
hot path, so you need to do a lot of manual sizing of the cache to
ensure that you have slots free for your objects. This became impossible
with too many active elements in the cache - together with us caching
both small and large objects in the same store.

I'm not necessarily convinced that we have a high rate of cache hits
here, beyond for some few "repositories" that people are using for
hermetic builds (e.g. there are a lot of requests with a bazel
User-Agent with pinned versioned repositories).

> 3. Varnish internal redirect to farm not working [4], unfortunately
> reverted due to not working properly in prod setup

I hope that I get to retrying the change today, on !mlm-01. I made the
mistake of testing the change on the production machine and not being
able to debug much after that.

I was personally annoyed by the browser not downloading the correct
filenames myself. :)

> [1] https://salsa.debian.org/apt-team/apt/-/merge_requests/383
> [2] https://salsa.debian.org/snapshot-team/snapshot/-/merge_requests/23
> [3]
> https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/commit/63f16e08199040871752135df533f0001fe537fb
> [4] https://lists.debian.org/debian-snapshot/2024/11/msg00008.html
> 
>>
>> DSA have legitimate concerns about exposing user data to people who
>> do
>> not need access to it. Would it help if my relation to Debian was
>> formalised further than the current status of Debian Contributor?
> 
> I'm just a DM, but I definitely want to help improving the situation.
> 
>>
>> More generally, I sometimes find it hard to understand the roles and
>> responsibilities wrt the snapshot service. This results in me on the
>> one
>> hand being overly cautious with asking for some things and on the
>> other
>> hand sometimes pestering the wrong people, most probably also in the
>> wrong way. It would be good to minimise unnecessary frustration and
>> lost
>> calendar time.
> 
> Same! It took me quite some time to get an understanding of the overall
> architecture of s.d.o which all its layers. Also I don't know who is
> responsible for the intermediate infrastructure (basically everything
> between the s.d.o flask app and the DNS entry s.d.o).

It should be simpler. It's a bit of a Rube Goldberg machine when you
have multiple caching/proxying/rate limiting layers.

I jumped in because it looked like there's some more attention/bandwidth
needed temporarily. (And I temporarily had some more time on my hands to
help out.)

I cannot speak for DSA just yet - but in general the delineation is that
DSA wires up the web setup to serve things and the remainder is on the
service owner. Of course here we have a ton of components in Puppet
(haproxy/varnish/iptables/apache) that have snapshot-specific
configuration bits. That means that any change to the outer
infrastructure requires time from a DSA member to test and deploy the
change.

IMO the most important concern around granting more access is around
privilege escalation - i.e. can a configuration change influence the
machine's config. We have divested the power to change apache2 configs
to service owners, for instance - the config changes are versioned and
then apache2 is reloaded. apache2 also does not crash on invalid configs
- although it will not start properly anymore. varnish as handled by
Puppet is restarted, not reloaded - and thus will fail to start when the
VCL is broken.

> I further can only guess where exactly the bottlenecks are. These
> obviously depend on the usage patterns which I (for good reasons) do
> not have insights into.

Munin is also not super helpful to visualize this data. I'd be open to a
setup that allows for more custom introspection (specifically latency,
error codes, and internal service state), e.g. a Grafana instance.

Kind regards
Philipp Kern


Reply to: