[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: apt-get improvement idea



Update: a solution class is to postpone mirror-selection (between
valid choices for each package) to download-time. One advantage of
this approach is the information that is available to the transport
modules, e.g. download speed of each server.
I can think of two algorithms in this class:
1. Greedy approach: always select the first non-busy (not already
downloading) mirror.
2. Per-file concurrency: some protocols, HTTP for one, let you
download a file concurrently from different mirrors. For example if
you have 3 HTTP mirrors, you can setup a connection to each of them
requesting one third of the file. I personally don't like this
approach. Things can get complicated when your choices are of
different protocols. Also the download speed for a file is limited to
its slowest connection. There are workarounds, but the whole approach
is not a simple and clean one.

On Fri, Jun 25, 2010 at 7:10 PM, Mohammad Ebrahim Mohammadi Panah
<ebrahim@mohammadi.ir> wrote:
> Hi,
>
> Let me restart. (I try to also address David's concerns.)
>
> Problem description: If some package version is downloadable from
> various sources, APT always chooses the first source listed in
> sources.list out of all available choices. So it downloads all these
> packages sequentially, and from one server. This is slow. The speed
> could be improved without imposing noticeable overhead on mirrors by
> connecting to more servers simultaneously and downloading files
> concurrently instead of only one connection to a server and one file
> at a time. This problem is most clear when you have a network (on the
> client side) that gives you more bandwidth if you have more TCP
> connections.
>
> Please note that I DON'T propose having more than one connection to a mirror.
> Also note that the mechanism I described is already implemented in
> APT, but it is only used when some package is only available in a
> lower source in sources.list. For example packages of ftp.debian.org
> and debian-multimedia.org are usually downloaded concurrently using
> two connections: one to ftp.debian.org and one to
> debian-multimedia.org.
> I think it is clear why this is not DoS, nor DDoS. It even encourages
> the user to distribute his load between multiple servers.
>
> Semi-ideal solution = Knap-Sack(?): Give each mirror a capacity score.
> (Your connection bandwidth to that server? Take server's power into
> account?) Also give each package a cost. (Its download size?) Find the
> best way to fit package costs in server capacities.
>
> Simpler solutions = Round-Robin: Select the first choice, but move
> that mirror to end of list to prevent starvation of other
> possibly-available mirrors.
>
> Simplest solution = Random: Choose randomly between available sources
> for the package, and pray for your random generator! After all your
> random generator doesn't always return 0, so it is still an
> improvement over current situation. (I doubt the round-robin solution
> is much better than this random solution.)
>
> Patch: I've got a tiny patch for apt-pkg/acquire-item.cc that
> implements the "Random" solution. See attachment to find the patch in
> "bzr send" format. To show how tiny the patch is: diff -b contains 3
> removed lines and 13 added lines.
>
> If backward-compatibility is an issue with my patch, we can add an
> option and default it to off. Or something like what mrvn suggested on
> IRC.
>
> -- Mohammad Ebrahim Mohammadi Panah
>
> On Fri, Jun 25, 2010 at 4:46 PM, David Kalnischkies
> <kalnischkies+debian@gmail.com> wrote:
>> Hi,
>>
>> 2010/6/24 Mohammad Ebrahim Mohammadi Panah <ebrahim@mohammadi.ir>:
>>> I've got an idea for apt-get, which I discussed in #debian-devel. I want to
>>> know what you think about it. Also I need your guidance to implement that.
>>> This is the IRC discussion log:
>>
>> strange, i can be online at whatever time, but it seems i am never
>> around if something is discussed which is at least relevant for me…
>>
>> Anyway: (reordered quotes and answers)
>>
>>> [16:56] <ol> does it currently always select the first?
>>
>> Yes it does - or, it starts with the one listed at first in the sources.list.
>> If the download of the package fails it will use the next one (if available).
>> If that didn't work it would be a bug - it used to be working in the past.
>>
>>
>>> [16:55] <ebrahim> Feasibility: It is currently possible to add some
>>> similar repo's to sources.list. Also APT knows how to download concurrently
>>> from different servers. I just need to tell APT not to always select the first
>>> source in case of more than one source for that version.
>>
>> Is it really that simple? How do you know the next mirror to try if
>> the last one failed for example?
>> Your current implementation has the lovely effect that if Mr. Random
>> chooses to choose x times the same mirror APT will give up on x+1…
>> Also, rand() is not a round-robin implementation…
>>
>>
>>> [17:03] <ebrahim> I chose to connect to more servers rather than having
>>> more connections to the same server, for the sake of Debian mirror servers! :)
>>
>> You have only one connection open to one mirror server at a given time.
>> APT doesn't open a connection for each single package - it does sent
>> a request for each package over this connection (see pipelining).
>> So what you "save" here is "only" time - for now. (see the next one)
>>
>>
>>> [17:02] <ebrahim> ol, download acceleration through more TCP
>>> connections is a well-know technique. It is not just me.
>>
>> And most of the time a stupid one as your downloader hammers
>> the server with multiple requests to have more chances to be served
>> in the round robin process. If all users would do that you would gain
>> nothing expect a time penalty… and maybe less mirrors as not every
>> mirror hosting free software has the soul propose in hosting it, but
>> servers a different propose in general but has some free resources…
>>
>> Splitting across multiple servers can have good effects (e.g. bittorrent)
>> but also increases the overall flow of data which need to be transferred.
>> It does help nothing if you query 5 fast servers for different packages
>> if your connection can only handle the data flowing in from one…
>> (bittorrent is different as the nodes have in general not the same
>> good uplink as a "normal" server normally)
>>
>> I guess the overall speed could be better improved by choosing a
>> (maybe local and maybe less known) mirror for the user automatically
>> based on some intelligent heuristic rather than executing an
>>  "apt-get ddos" command on a few well known…
>>
>>
>>> [17:31] <mrvn> ebrahim: If you patch it then please add an priority option
>>> (as in   deb [pri=<N>] url suite component). Make it round-robin only
>>> between sources with equal priority and default to the line number (or
>>> something) so the old behaviour remains.
>>
>> Use case? Why someone should want to prefer the download of the SAME
>> version from a trusted mirror instead of another trusted mirror.
>> If the versions were different it is different obviously, but in this case the
>> option to choose between the two for downloading doesn't exist
>> in the first place…
>>
>>
>> Not your fault as it is currently a bit confusing, but have a look at the
>> experimental repository as the current experimental releases are based
>> on that one. It includes also a draft implementation of the mirror-protocol
>> which you might find interesting…
>> http://bzr.debian.org/apt/apt/debian-experimental-ma/
>> And maybe have a look at "bzr send".
>>
>>
>> Oh and btw, i don't want to sound like a babbitt but i don't see in the log
>> that you asked the participators for their permission to publish the log.
>> It is questionable if an IRC channel like #d-d isn't already public enough,
>> but in general the content of an IRC channel is volatile and limited to the
>> audience in the channel at that time - while a mailinglist archive is open
>> for everyone to read also in twenty years from now on.
>>
>> And as a second btw: A more precise title would be fabulous next time…
>>
>>
>> Best regards,
>>
>> David Kalnischkies
>>
>


Reply to: