[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: apt-get improvement idea



Hi,

Let me restart. (I try to also address David's concerns.)

Problem description: If some package version is downloadable from
various sources, APT always chooses the first source listed in
sources.list out of all available choices. So it downloads all these
packages sequentially, and from one server. This is slow. The speed
could be improved without imposing noticeable overhead on mirrors by
connecting to more servers simultaneously and downloading files
concurrently instead of only one connection to a server and one file
at a time. This problem is most clear when you have a network (on the
client side) that gives you more bandwidth if you have more TCP
connections.

Please note that I DON'T propose having more than one connection to a mirror.
Also note that the mechanism I described is already implemented in
APT, but it is only used when some package is only available in a
lower source in sources.list. For example packages of ftp.debian.org
and debian-multimedia.org are usually downloaded concurrently using
two connections: one to ftp.debian.org and one to
debian-multimedia.org.
I think it is clear why this is not DoS, nor DDoS. It even encourages
the user to distribute his load between multiple servers.

Semi-ideal solution = Knap-Sack(?): Give each mirror a capacity score.
(Your connection bandwidth to that server? Take server's power into
account?) Also give each package a cost. (Its download size?) Find the
best way to fit package costs in server capacities.

Simpler solutions = Round-Robin: Select the first choice, but move
that mirror to end of list to prevent starvation of other
possibly-available mirrors.

Simplest solution = Random: Choose randomly between available sources
for the package, and pray for your random generator! After all your
random generator doesn't always return 0, so it is still an
improvement over current situation. (I doubt the round-robin solution
is much better than this random solution.)

Patch: I've got a tiny patch for apt-pkg/acquire-item.cc that
implements the "Random" solution. See attachment to find the patch in
"bzr send" format. To show how tiny the patch is: diff -b contains 3
removed lines and 13 added lines.

If backward-compatibility is an issue with my patch, we can add an
option and default it to off. Or something like what mrvn suggested on
IRC.

-- Mohammad Ebrahim Mohammadi Panah

On Fri, Jun 25, 2010 at 4:46 PM, David Kalnischkies
<kalnischkies+debian@gmail.com> wrote:
> Hi,
>
> 2010/6/24 Mohammad Ebrahim Mohammadi Panah <ebrahim@mohammadi.ir>:
>> I've got an idea for apt-get, which I discussed in #debian-devel. I want to
>> know what you think about it. Also I need your guidance to implement that.
>> This is the IRC discussion log:
>
> strange, i can be online at whatever time, but it seems i am never
> around if something is discussed which is at least relevant for me…
>
> Anyway: (reordered quotes and answers)
>
>> [16:56] <ol> does it currently always select the first?
>
> Yes it does - or, it starts with the one listed at first in the sources.list.
> If the download of the package fails it will use the next one (if available).
> If that didn't work it would be a bug - it used to be working in the past.
>
>
>> [16:55] <ebrahim> Feasibility: It is currently possible to add some
>> similar repo's to sources.list. Also APT knows how to download concurrently
>> from different servers. I just need to tell APT not to always select the first
>> source in case of more than one source for that version.
>
> Is it really that simple? How do you know the next mirror to try if
> the last one failed for example?
> Your current implementation has the lovely effect that if Mr. Random
> chooses to choose x times the same mirror APT will give up on x+1…
> Also, rand() is not a round-robin implementation…
>
>
>> [17:03] <ebrahim> I chose to connect to more servers rather than having
>> more connections to the same server, for the sake of Debian mirror servers! :)
>
> You have only one connection open to one mirror server at a given time.
> APT doesn't open a connection for each single package - it does sent
> a request for each package over this connection (see pipelining).
> So what you "save" here is "only" time - for now. (see the next one)
>
>
>> [17:02] <ebrahim> ol, download acceleration through more TCP
>> connections is a well-know technique. It is not just me.
>
> And most of the time a stupid one as your downloader hammers
> the server with multiple requests to have more chances to be served
> in the round robin process. If all users would do that you would gain
> nothing expect a time penalty… and maybe less mirrors as not every
> mirror hosting free software has the soul propose in hosting it, but
> servers a different propose in general but has some free resources…
>
> Splitting across multiple servers can have good effects (e.g. bittorrent)
> but also increases the overall flow of data which need to be transferred.
> It does help nothing if you query 5 fast servers for different packages
> if your connection can only handle the data flowing in from one…
> (bittorrent is different as the nodes have in general not the same
> good uplink as a "normal" server normally)
>
> I guess the overall speed could be better improved by choosing a
> (maybe local and maybe less known) mirror for the user automatically
> based on some intelligent heuristic rather than executing an
>  "apt-get ddos" command on a few well known…
>
>
>> [17:31] <mrvn> ebrahim: If you patch it then please add an priority option
>> (as in   deb [pri=<N>] url suite component). Make it round-robin only
>> between sources with equal priority and default to the line number (or
>> something) so the old behaviour remains.
>
> Use case? Why someone should want to prefer the download of the SAME
> version from a trusted mirror instead of another trusted mirror.
> If the versions were different it is different obviously, but in this case the
> option to choose between the two for downloading doesn't exist
> in the first place…
>
>
> Not your fault as it is currently a bit confusing, but have a look at the
> experimental repository as the current experimental releases are based
> on that one. It includes also a draft implementation of the mirror-protocol
> which you might find interesting…
> http://bzr.debian.org/apt/apt/debian-experimental-ma/
> And maybe have a look at "bzr send".
>
>
> Oh and btw, i don't want to sound like a babbitt but i don't see in the log
> that you asked the participators for their permission to publish the log.
> It is questionable if an IRC channel like #d-d isn't already public enough,
> but in general the content of an IRC channel is volatile and limited to the
> audience in the channel at that time - while a mailinglist archive is open
> for everyone to read also in twenty years from now on.
>
> And as a second btw: A more precise title would be fabulous next time…
>
>
> Best regards,
>
> David Kalnischkies
>

Attachment: random-source-selection.bzr-send.gz
Description: GNU Zip compressed data


Reply to: