[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Possibly moving Debian services to a CDN



On 13/10/13 at 08:44 +0200, Tollef Fog Heen wrote:
> We appreciate feedback while we continue our investigation of CDNs.
 
Hi,

I'm trying to summarize the discussion so far and add my own
understanding/thoughts, in a set of Q & A.

Q: What problem are we trying to solve? What's the current status? 
==================================================================

The Debian project needs to distribute content over HTTP; mainly
packages (on ftp.d.o and security.d.o) and websites (such as www.d.o).
For that, a set of machines running Debian and managed by DSA and
located in various datacenters all over the world are used. 

Additionally, for ftp.d.o, Debian relies on mirrors provided by third
parties.  Those mirrors are not managed by DSA, and might be running
non-free software (that could include primary mirrors, ie
ftp.*.debian.org).

Mirroring the Debian archive is tricky, as files need to be copied in
the correct order (to avoid having files in dists/ point to files
not yet copied in pool/).  A script (ftpsync[1]) is provided by the
mirrors team, but is not used on every mirror AFAIK.

[1] http://www.debian.org/mirror/ftpmirror

The performance of our {packages,website} delivery network is an
interesting question. Like many things on the Internet, it's related to
a mix of bandwidth, latency, and application behaviour (e.g. use of HTTP
keep-alive).  More and more, the dominating factor in network
performance is latency (as others are easier to optimize), and the
only way to reduce it is to have servers close (geographically or
network-wise) to end users. Benchmarking mirrors by measuring
bandwidth is generally not very relevant.

This raises several challenges:
- DSA needs to interact with many datacenters, often for only one
  machine. This is very time-consuming.
- The mirrors team needs to constantly monitor mirrors and notify mirror
  operators in case of problems. Notifications are automated, but DNS
  updates to *.debian.org when a mirror fails are not.
- There are parts of the world that are not so well covered. For
  example, http://deb.li/y8GA is the current map of security.d.o
  mirrors (which are all managed by DSA), we don't have any point of
  presence in Asia, which causes poor performance. There are discussions
  in progress to buy a server and host it somewhere in Asia, and the cost
  for Debian would be between $1500 and $2500 depending on the server's
  specs.

One solution that has been developed is http.d.n. It's a redirector
service that redirects to the closest working mirror (the mirror
checking is automated).  However, the http.d.n machine is still
centralized: round-trip time to it is still a problem, so, if the
service would become official, several geographically-distributed
instances of the service would have to be set up. Also, as each request
goes through a http.d.n redirect, there's a lot of additional latency.
If we want those http.d.n redirector machines to be managed by DSA
(which is probably something we want), it doesn't really improve the
situation in terms of machines DSA has to managed.

Q: What are CDNs? How do they compare to our mirrors network?
=============================================================

Content Delivery Networks (Akamai, Fastly, Amazon Cloudfront, etc. [1])
can be seen as giant location-aware caching networks. They provide
"local" points of presence and manage global caching of external data
inside the CDN network.
[1] http://www.cdnplanet.com/cdns/

As a solution based on caching, they work and perform quite differently
from our mirrors (where the Debian archive is fully replicated). It's
not easy to compare their performance, especially if you want to
consider access patterns on the mirrors (file sizes, long tail
distribution, etc.)

Q: Do CDNs raise more security/privacy concerns than our mirrors?
=================================================================

Not easy to answer. I'm inclined to say that they both raise about
the same amount of concerns. There's more discussion about those points
in the subthread starting at
[🔎] 2A773832-09F2-4ADB-9B10-2A554B6DDC1A@2013.bluespice.org">http://lists.debian.org/[🔎] 2A773832-09F2-4ADB-9B10-2A554B6DDC1A@2013.bluespice.org

Q: How does that meet with Debian's Social Contract and Free Software in
========================================================================
general?
========

Some CDNs use Free Software. As data points, Fastly[1,2] uses and
contributes to Varnish, and the frontend servers of Amazon Cloudfront
are running Apache.

[1] http://www.fastly.com/about
[2] http://www.fastly.com/about/open-source

Building a CDN is mostly an infrastructure problem: bring PoP in many
parts of the world, manage those servers, etc. It would be about "Free
Infrastructure" more than "Free Software".

How much do we (Debian) care about Free Infrastructure?

The Social Contract says:
> 1. Debian will remain 100% free
> [..] We promise that the Debian system and all its components will be
> free according to these guidelines. [..] We will never make the system
> require the use of a non-free component.

Where does "the Debian system and all its components" stop? Does it
include our packages / website content delivery network? I'm inclined to
say "no".

The Social contract also says:
> 2. We will give back to the free software community
> When we write new components of the Debian system, we will license them
> in a manner consistent with the Debian Free Software Guidelines. [...]

However, that doesn't address using "components" developed by
third-parties, and is restricted to "components of the Debian system".

So, I'm inclined to say that the Social Contract doesn't say anything
about the current question.

So, one question is more: where do we draw the line?
- Should we use machines that require non-free firmware in the Debian
  infrastructure? (that's something we currently do)
- Should we have a stricter policy about the use of free software in
  our official mirror network?
- what about network equipment running non-free software?
The line has to be drawn somewhere, and I honestly don't know if CDNs
should be below or above the line.

Another question is whether maintaining our own CDN is really something
we *need* to spend our energy on. I don't think that the delivery of
packages is central in the mission of Debian, nor think that
maintaining our own CDN strengthen our message regarding software
freedom. After all, if we could use and point to 3-4 CDNs that are
advocating Free Software, isn't it better to show that such core
Internet services can be run using Free Software?

Q: Where should we go from here?
================================

CDNs raise significant challenges:
- Can we find 3-4 CDNs (to remain independent) that:
  + are willing to provide that service for free to Debian
  + are IPv6-compliant and meet our other technical requirements
  + are publicly free software-friendly
  (none of those are super-strong requirements, but if they are not met,
  that raises additional questions)
- Can we combine those various CDNs under the same *.debian.org name?
- Can we solve the problem of data in dists/, that need a specific
  caching policy?

I would like to encourage more experimentation from DSA on this, either
as an unofficial service or as an official service under a different DNS
name ({ftp,security,www}.cdn.debian.org?).

However, I think that it's too early to make CDN-provided hosts part of
the resolution of "normal" DNS names such as security.d.o or www.d.o,
until we have a better understanding of the pros and cons of CDNs.

Finally, I think that we should continue to provide an easy way for
someone to run its own Debian mirror. (But in the distant future, if our
feedback on CDN is positive, it means that we could remove some of
Debian PoP since mirrors could synchronize over rsync from more central
locations).

Lucas

Attachment: signature.asc
Description: Digital signature


Reply to: