Re: ftpsync README rewrite

To: debian-mirrors@lists.debian.org
Subject: Re: ftpsync README rewrite
From: Lee Winter <lee.j.i.winter@gmail.com>
Date: Thu, 17 Sep 2009 14:39:13 -0400
Message-id: <cf1144010909171139y2e6aa557xbbaf700e326fd6fb@mail.gmail.com>
In-reply-to: <878wge2dai.fsf@gkar.ganneff.de>
References: <cf1144010909041809s5dc13f20q21fff32658c71e76@mail.gmail.com> <20090906083936.GD25328@dedibox.ebzao.info> <cf1144010909141920q17121c22u4c4956d0fd73b383@mail.gmail.com> <878wge2dai.fsf@gkar.ganneff.de>

On Thu, Sep 17, 2009 at 4:20 AM, Joerg Jaspert <joerg@debian.org> wrote:
> On 11874 March 1977, Lee Winter wrote:
>
>> So in the mean time I have accomplished two things.  I analyzed the
>> debian subsections with an eye to writing up a tuning article on
>> mirror admin.
>
> This you will never see in ftpsync. It isnt meant for such broken
> setups.

Partial mirrors are not broken.  They are just partial.

> (As soon as you have more than yourself using the mirror the
> assumption of which section applies is almost always wrong).
> You have to use debmirror for this.

There is a hierarchy of mirrors and different layers within the
hierargy need different tools.
  -- By definition the internal mirrors have to handle the entire repository.
  -- The external official mirrors already have subsets based on
distribution, section, and architecture.  They should also have the
ability to subset based on localization.
  -- The unofficial public mirrors should have the capability provide
additional constraints, certainly at least localization, and arguablly
by topics, whether topics are subsections or labels (AKA tags).
  -- Private mirrors, such as might be operated on a campus,
definitely need the capability exercise fine-grained control based on
topics, usually on the basis of exclusion.  For example, games and
video sections might be reasonable to exclude
  -- Local mirrors (local usually means on the same network segment)
are often dedicated to a single purpose such as a lab, classroom, etc.
 They definitely need to have fine-grain control, usually on the basis
of inclusion -- i.e., only what is known to be useful.  So they need
something even finer than subsections.

The debmirror script is almost useful for the last two layers, pending
current bug fixes.  But it would be best to have the unofficial and
private mirrors using the same tools as the higher level mirrors.  So
fine-gran control is appropriate for those tools -- it is not a
defect.

>> E.g., if your uplink is a 56Kbps link with a 30% duty cycle and you
>> live under a triple canopy forest teaching elementary school you may
>> not have much use for the devel or electronics subsections.  But there
>> is not enough information available about what including or excluding
>> a particular subsection costs in space and bandwidth.  For 5.03 the
>> space numbers are as follows (a/o 2009-09-12):
>
> It will be very disappointing for you (or your users) if you do it. A
> package might not be in the section you think it would, simply cos it
> would fit multiple and the maintainer or ftpmaster selected another.

Could be.  I've not had a problem with that in the last few years.
Probably because my user population is small and homogeneous.

>
> Also, sections are doomed to go away, so building a mirror based on them
> is already walking on a blind alley.

No.  It will not be a blind alley.  Transferring from subsections to
labels is a step upward, and thus trivial.  It does not mean one has
to start over at the beginning.

And even the worst case is to stop using subsections and temporarily
take everything, at the cost of considerable bandwidth, until pruning
and optimization using the replacement for subsections develops.
After all, the default set of labels is probably going to contain the
existing subsections, which any admin could implement even if the
release team does not.

As for your claim that subsections are doomed to go way, I expect
debtags to arrive around the same time as the year of the linux
desktop.  And I won't hold my breath waiting for either.

>
>> However, I see no simple way to compute that over a reasonable sample
>> period, like a month, without a serious investment in package tools.
>> Do they aready exist?
>
> http://ftp-master.debian.org/size-quarter.png

No, that is by architecture.  Are similar stats or stats-tools
available for subsections?

>
> Thats complete Debian, so we have updates between 2 and 6GB, 4 times a
> day.

If you could update as often as possible, assuming little or no CPU
load on the source mirror, little or no IO load on the source mirror,
and a very narrow (<1 sec) update period, how often would it be useful
to do a push?  I.e., what is a reasonable ceiling on the update
frequency?

>
>> The other thing I accomplished is to find a way to eliminate the load
>> that rsync imposes on upstream mirrors.  IMO it would not take much
>> work to tweak rsync and adjust the calling scripts so that the
>> upstream mirrors could be completely passive.
>
> If you find a way that actually works nicely I am MOST interested to
> learn it!

I found a way to eliminate the CPU load and a way to narrow the update
window to a very short period.  I am working on eliminating the IO
load, but that is going to take some actual work.

It would be best to keep the external UI/APIs of rsync so that all of
the advanced file system heuristics are left in the laps of the rsync
maintainers.  So I am trying to do the work without causing an
earthquake that would results in the maintainers bouncing the changes.
 So eliminating the IO load has to be implemented very carefully.

We need three things from rsync and I have located one of them.  The
other two will take some more R&D.  Once we have the hooks in rsync we
will only need some simple scripts to manage the overall process. so
that updates become extremely efficient.

> Note that the rsync batch thingie it offers does not work for us.

Agreed.  It is aimed in a different direction.

> We need to keep in mind that our mirrors:
>
>  - simply update everything. That can easily be done with current rsync
>   features, by us writing a batch file for it.
>   Combined with a detection if they missed an update or not, it would
>   be something doable.

But it would still hammer the source mirror I think.

>  - update a subset only. And that subset can either be the "typical" foo
>   we offer, or any combination of architectures you can imagine.
>
> The latter is what kills.

Why does the subsetting hurt so much?  If the killer is the initial
file system scan I can reduce that substantially almost immediately.
I think -- it is not tested yet.

Lee Winter
NP Engineering
Nashua, New Hampshire

Reply to:

Follow-Ups:
- Re: ftpsync README rewrite
  - From: carlos@fisica.ufpr.br (Carlos Carvalho)

References:
- ftpsync README rewrite
  - From: Lee Winter <lee.j.i.winter@gmail.com>
- Re: ftpsync README rewrite
  - From: Simon Paillard <simon.paillard@resel.enst-bretagne.fr>
- Re: ftpsync README rewrite
  - From: Lee Winter <lee.j.i.winter@gmail.com>
- Re: ftpsync README rewrite
  - From: Joerg Jaspert <joerg@debian.org>

Prev by Date: Re: ftpsync README rewrite
Next by Date: Re: ftpsync README rewrite
Previous by thread: Re: ftpsync README rewrite
Next by thread: Re: ftpsync README rewrite
Index(es):
- Date
- Thread