[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Squeeze: sometimes, bind times out (backgrounded) at boot time



Joao Roscoe wrote:
> > Seems reasonable.  I still use the broadcast protocol instead.  But
> > what you are doing is supposed to work okay and I can only assume that
> > it does.
> 
> Tried the broadcast protocol. Unfortunately, no deal :-(

Don't know.  Works for me.  I like it since that way any of the
servers may be down/up and the client will bind to any of them.  That
combination gives a nice bit of failover redundancy.  (Shrug.)

> I have around 20 boxes here. All of them were built as images from a
> reference machine, which received a clean squeeze install.
> For each machine, the image was dumped (with partimage), the hostname
> was changed, and the file /etc/udev/rules.d/70-persistent-net.rules
> was removed.

Seems reasonable.  I do a little bit more than that but mostly things
specific to what I have installed.  Such as configuring Postfix for
the new hostname and so forth.  Both /etc/hostname and /etc/mailname
get updated.  I assign static addresses and therefore
/etc/network/interfaces is updated.  I use a single ssh server key
among the collective because they are intended to be identical.  So I
ensure that /etc/ssh/ssh_host_*_key* files are updated appropriately.
And I think that is sufficient.

> So, all of them should behave the same way. However, some
> of them boot ok most of the times, others present NIS serve bind
> timeout everytime. Quite confusing...

If the hardware isn't completely identical then it is reasonable to
have differences in the parallel boot timings.  With the new parallel
boot there will be forks and joins of the process flow during boot
time.  IIRC it is implemented using 'make -jX' to achieve parallel
operation when possible.  And since the behavior is new there are
bound to be bugs that will affect people using it out of the
mainstream paths.  Using it with NIS/YP is not so common so I think it
not unlikely that there is a bug related to it there.

In particular I think I have seen cases, unverified, that even though
an init.d script completed that the service it started wasn't yet
ready to serve.  For example I am pretty sure I have seen problems
with bind starting up and being ready to serve immediately.  Can't
confirm this though.  But it seems suspicious given your symptoms.
Or nis starting up may be similar.

> > In either case, I use the following configuration line for hosts in
> > /etc/nsswitch.conf.
> 
> Tried that also. No improvement. In fact, I started getting some DNS
> trouble with a few older hosts. Looks like our DNS infrastructure is
> completely messed up

That seems like a completely separate issue.  Probably should separate
the two problems and address each one individually.  Would be happy to
help with the DNS configuration too.  Describe how it is set up and
the list could provide feedback on how to improve it.

DNS is a marvelously designed distributed database system.  It isn't
perfect.  There are a few problems.  They didn't think of everything
when it was designed.  It is a huge improvement over the previous
system.  But it is only as good as the configured network around it.

> Now, what really puzzles me: as I told before, "Restarting nis and
> autofs, in this order *does* solve the issue", and that's quite fast!
> Why doesn't it work at boot time?

Try this experiment.  At the last point in the /etc/init.d/nis startup
script add a short sleep.  That will give the daemons time to finish
and get ready to go.  It is possible that they are not yet quite ready
yet and so immediately after the end of the script the next one to run
hits them too early.

I suggest changing this in file /etc/init.d/nis:

  case "$1" in
    start)
          do_start
          ;;
    stop)

To this as an experiment:

  case "$1" in
    start)
          do_start
          sleep 5   # <-- Add this sleep to give things more time.
          ;;
    stop)

I would do the same thing for /etc/init.d/bind9 too.  Then see if that
resolves the problem.  I am not proposing this as a full solution nor
even saying that must be the problem.  But I would definitely try it
as an experiment to gain data and characterize the problem.  And if it
works then that might be a good enough workaround for you until the
problem really is resolved.  (Or it might be the 'allow-hotplug'
described below.)

> > I...sounds like
> > some incorrectly specified dependency in the /etc/init.d/* scripts.
> 
> I agree with you, but I took a look at the scripts, and they look fine
> - autofs seems to depend on nis (I'm afraid I don't know this new init
> scheme very well, however).

Traditionally Sun systems would store automount maps in nis files
making them available through nis/yp to client machines such as
through 'ypcat -k auto.master' and other files.  The autofs startup
script obtains the configuration files this way dynamically at start
time.  This is optional.  It isn't required.  You may have configured
it either using real files on disk or using files in networked nis/yp
files.  If in the nis/yp files then the autofs script will try to use
them from nis.

> Anyway, this kind of issue would probably
> break things for a lot of people...

I have something else to try that I have learned in the last year
since your first note.  :-)

In /etc/network/interfaces it probably says:

  allow-hotplug eth0

Change that to:

  auto eth0

The allow-hotplug enables the event driven startup.  The auto enables
the traditional startup.  I have had some issues with the event driven
startup similar where things will block for a long time at boot time
waiting for various events to happen.  Using auto instead forces the
previously hard set flow and avoids the problem.  Specifically when
using nfs mounts in /etc/fstab.  Again as an experiment I would switch
to 'auto' for the network startup.  That by itself might be your
solution.  (Or it might be the startup sleep delay described above.)

> > But because it is so annoying before too long someone
> > will have debugged it and gotten the offenders removed from the
> > mailing list.
> 
> Got a probe email a few days ago - someone worked on it. Hope the
> issue is already solved.

Unfortunately the problem persists.  I conversed briefly with the
listmasters and they are aware of it but no one has been able to
deduce the offender.  The joe1assistly spam has also affected some of
the Cygwin mailing lists too.  I have examined the spam coming my
direction and I can't deduce a clear solution to it.  Of course I
could block it for myself by blocking any Message-Id: with
joegiglio.org in it but that wouldn't help the mailing list at large.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: