Mirrorbrain demo instance for debian-cd?
As I noticed again when Jessie was released, the debian-cd [1]
download-page is not exactly userfriendly. It contains only links to the
main cd-image download-site in Sweden, and a very raw/uncomfortable list
of mirrors. It also contains the hint to use the primary server in
Sweden if in doubt because some mirrors might be out of date - which is
not that helpful, because it does make quite a difference whether you
download a 4 GB .iso with 100 MB/s from a local mirror or with 100 KB/s
from another continent.
I think it is quite clear that ideally this page should most prominently
contain a link to a download-redirector that automatically redirects the
user to a mirror that
a) is network-wise close to the user
b) is currently up
c) really has the file
For the apt-repositories such a thing now exists as
httpredir.debian.org. However it does not support debian-cd, and
according to Raphael (on IRC), it would be very hard to add support for
that.
There is however existing software that does exactly what is needed for
debian-cd, a very popular one is Mirrorbrain [2].
As it happens, I've been wanting to play around with mirrorbrain for
some time now. No particular reason for that, mainly just curiosity. I
also run many open source mirrors, and it's hard to overlook the fact
that most traffic to our mirrors comes from projects utilizing
mirrorbrain, while mirrors of projects not using mirrorbrain or another
"CDN software" (like debian-cd) are usually underutilized.
Now I did play around with mirrorbrain last weekend, and as a result I
now have a running mirrorbrain instance. If there is interest from the
debian mirrors team, I would be willing to run this as a demo for a
debian-cd CDN. It would allow for evaluaing if or how good this could
work - and if it does, hopefully motivate someone to bring the
mirrorbrain-packages into the standard debian repositories, which I've
been told is a prerequisite for it moving to proper DSA-maintained
infrastructure at some point. Until then I would be willing to keep
running it, as I do not expect it to cause significant traffic-usage, or
load on the machine running it.
The instance is currently running at http://debian-cd.poempelfox.de/ and
heavily castrated: I currently have added only about 15 mirrors to it,
and all cronjobs that would regulary scan the mirrors for available
files and whether they are alive are disabled, because I will simply not
scan all mirrors in regular intervals without approval from the debian
mirrors team. That however could be changed quickly. Normally
mirrorbrain would check if mirrors are alive every 5 minutes (by doing a
GET on the 'root' URL of the mirror), and I would suggest scanning the
mirrors for available files every 24 hours. Scanning can be done via
HTTP, but only if the server prints parseable directory listings that
include timestamps and filesize. Scanning via rsync is the most
resilient option, and ftp is also possible as an alternative. Thanks to
the scanning, mirrorbrain knows exactly which mirror has which files,
and will never redirect a client to a mirror that doesn't have the
requested file.
So what can mirrorbrain do?
Here is an example of the mirrorlist-output for one DVD image when
requested from an IPv4 only host at the university, where it works
almost perfectly: http://www.poempelfox.de/tmp/mb-ipv4-good.html
You can get that info for any file by simply appending ".mirrorlist" or
"?mirrorlist" to the filename.
As you can see, mirrorbrain has realized that there is 1 mirror on the
exact same network prefix as the user requesting this file. As long as
that mirror is up, the client would be redirected to it. If that mirror
is not up, mirrorbrain would look in the next category: same AS. Again,
there is 1 mirror available, and the client would be redirected to that.
Should both of these mirrors be down, the next catogory would be "same
country", and after that, "same continent", and finally "anywhere in the
world". If there is more than one mirror within a category, a random
mirror will be selected, with the "prio" value influencing the
selection: Higher prio means that the likelyhood of that mirror being
chosen is higher. Prio is simply a per mirror configuration option. This
allows to direct more users to mirrors with a lot of available
bandwidth. If the geoip-data for the client contains coordinates, the
geographical distance to potential mirrors will also influence mirror
selection within a category.
Note that unfortunately it will not always work that perfectly. My tests
up to now have already shown that it will not work as well for IPv6
clients for two reasons: ASN/Subnet-data is not currently available as a
free database for IPv6, so it cannot be used. To make matters worse, the
geoip-data from maxmind that is used is significantly less detailed and
complete for IPv6. For most IPv6 clients, the matching that can be done
is only "same country" - but that still would be a significant
improvement over the current situation.
So what are your opinions on this?
[1] https://www.debian.org/CD/http-ftp/
[2] http://www.mirrorbrain.org
Reply to: