[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: I suspect the kernel: `ping', and name resolution in general, hangs



Last month I had a problem: in short, I installed potato, and noticed
that name resolution hung, although things worked fine if I used a
numeric IP address.

I've included below the plea for help that I sent last month.  It
describes the problem in detail.

Well, in case anyone's interested, I have some more information that
leads me to suspect that the problem is in the kernel (and thus,
presumably, in the Vortex driver).  Here's what I did:

* I installed potato from scratch.  I did this by installing slink
  from an official Debian 2.1 CD, and then doing `apt-get
  dist-upgrade' with my /etc/apt/sources.list pointing at

	http://http.us.debian.org/debian unstable

  Thus I wound up with the latest (as of this morning) binaries, but
  with the 2.0.36 kernel from the CD.  (Apparantly `apt-get
  dist-upgrade' didn't automatically give me a new kernel.)  This
  system worked flawlessly; in particular, name resolution worked
  fine.

* I installed kernel-image-2.2.12 (version 2.2.12-3), and rebooted.
  Name resolution hung exactly as described below.

* I reinstalled kernel-image-2.0.36, and rebooted; name resolution
  worked just fine.

So it seems to me that the newer kernel is doing something wrong.  If
anyone would like me to perform some experiments, so as to isolate the
problem, I'd be happy to do them; just tell me what you need done.
Unfortunately, I know nothing about how the net card driver works, so
I don't know how to investigate this on my own.

Here's the plea that I sent last month:

    Can anyone tell me what's wrong with my system?  At first I assumed it
    was a bug in the resolver library, and opened a bug against libc6 in
    Debian potato (http://www.debian.org/Bugs/db/45/45912.html); but the
    Debian libc6 maintainer is sure that my system is merely
    misconfigured.

    Here's the problem:

    When I type `ping blarg.net' at a shell, `ping' hangs.  I expect it to display

	    PING blarg.net (206.124.128.1): 56 data bytes
	    64 bytes from 206.124.128.1: icmp_seq=0 ttl=62 time=25.7 ms
	    ...

    Other name resolution also fails.  For example, Netscape hangs when
    trying to visit web pages on machines other than mine.

    On the other hand, if I type `ping 206.124.128.1', that works fine.
    So I know that IP and the network card aren't entirely broken.

    I've never sat around and waited to see if `ping' eventually gets
    unstuck; I've always given up and hit control-C after no more than
    perhaps a minute.

    I'm using potato (that is, the still-unreleased version of Debian
    GNU/Linux), which I installed by first installing slink (i.e., Debian
    2.1) from an official CD-ROM, and then using `apt-get dist-upgrade'
    from

	     http://http.us.debian.org/debian unstable main

    I did that update around 24 September.

    Here is some information about the broken system:

    Package: netbase
    Version: 3.16-2

    Package: kernel-image-2.2.9
    Version: 2.2.9-2

      My network card driver is 3c59x:

	Sep 24 07:21:13 potato kernel: 3c59x.c:v0.99H 11/17/98 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/vortex.html
	Sep 24 07:21:13 potato kernel: eth0: 3Com 3Com Boomerang (unknown version) at 0xb800,  00:50:04:1b:f6:df, IRQ 11
	Sep 24 07:21:13 potato kernel:   8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
	Sep 24 07:21:13 potato kernel:   MII transceiver found at address 24, status 182d.
	Sep 24 07:21:13 potato kernel:   Enabling bus-master transmits and whole-frame receives.

    This problem didn't always happen, although I don't remember exactly
    when it started.  I know for certain that it didn't happen immediately
    after I installed slink, nor did it happen immediately after I
    upgraded to potato the first time.

    I've also seen this problem on a different installation of slink (on
    the same machine with the same hardware), but that problem
    mysteriously went away.  I now have both slink and potato on this
    machine, and slink works flawlessly.  Only potato has this
    name-resolution problem.

    I haven't noticed any error messages -- certainly none at the shell on
    which I ran `ping', and none in /var/log.

    I connect to the Internet via DSL, using a Cisco 675 router, which is
    a little grey box that sits on the floor (the phone company gave it to
    me when I signed up for DSL).  I have a phone cord that connects the
    router and my phone jack; I have an Ethernet cable that connects the
    router and my network card.

    The router is quite configurable, and perhaps its configuration is
    relevant: 

    * I've got it set to act as a DHCP server, although since I don't know
      how to make Debian use DHCP, I've told Debian to use a static IP
      address.  Since I only have one computer, there is no risk of having
      two IP addresses conflict.

    * It's doing something called `network address translation', which, as
      I understand it, means that my machine "appears" to the outside
      world to have a different IP address than what the machine thinks.
      That is (as you can see below in my network configuration files), my
      machine thinks its IP address is 10.0.0.2, but the outside world
      uses 206.124.128.30 (that address might change from time to time,
      because the router might be a DHCP client of my ISP).  Also, if I
      were to connect other machines to the router (with an Ethernet hub),
      they would get IP addresses like 10.0.0.3, 10.0.0.4, etc.; but they
      would *all* appear to the outside world as 206.124.128.30.  It would
      appear that this would cause total confusion, but it doesn't;
      somehow this `network address translation' keeps things from getting
      confused.  I don't understand how it does this, but it seems to work
      OK.  (The place I work used to have a similar setup; they had five
      machines connected to the Internet, all "sharing" an outside IP
      address; the machines all worked fine.)  The one tradeoff that I
      know of is that nobody in the outside world can connect to any
      servers that I run, because the network address translation
      apparantly futzes with port numbers.  For example, my SMTP server
      listens on port 25, but someone who tries to connect to that port
      using my outside IP address 206.124.128.30 won't be able to.
      Presumably, if they could guess the port to which the router has
      "mapped" port 25, they could connect to that port.

      There may be some more information about the configuration of this
      box that is relevant.  Please feel free to ask me about it, if you
      think it would help.

    Perhaps some of the following network configuration files are
    relevant:

    /etc/resolv.conf:
	nameserver 206.124.128.1
	nameserver 206.124.128.3

    /etc/hosts:
	127.0.0.1   localhost loopback
	 10.0.0.1   cisco-router
	 10.0.0.2   potato

    /etc/init.d/network:
	#! /bin/sh
	ifconfig lo 127.0.0.1
	route add -net 127.0.0.0
	IPADDR=10.0.0.2
	NETMASK=255.255.255.0
	NETWORK=10.0.0.0
	BROADCAST=10.0.0.255
	GATEWAY=10.0.0.1
	ifconfig eth0 ${IPADDR} netmask ${NETMASK} broadcast ${BROADCAST}
	route add -net ${NETWORK}
	[ "${GATEWAY}" ] && route add default gw ${GATEWAY} metric 1

    Note that those three files are almost-exact copies of the same files
    on my slink system, which as I said works fine.  The only differences
    are 
	--- /slink/etc/resolv.conf	Sun Sep 12 04:06:13 1999
	+++ /potato/etc/resolv.conf	Mon Sep 20 22:00:49 1999
	@@ -1,3 +1,2 @@
	-search hanchrow.org
	 nameserver 206.124.128.1
	 nameserver 206.124.128.3

    (I don't know what that `search' line is doing on my slink system; I
    assume that it got put there when I installed the system)

	--- /slink/etc/hosts	Sun Sep 12 12:49:07 1999
	+++ /potato/etc/hosts	Tue Sep 21 22:29:44 1999
	@@ -1,3 +1,4 @@
	 127.0.0.1	localhost loopback
	  10.0.0.1	cisco-router
	- 10.0.0.2	snowball
	\ No newline at end of file
	+ 10.0.0.2	potato
	+

    Now, here's the kicker: the problem goes away if I run `tcpdump': I do

	   tcpdump &
	   ping blarg.net

    and `ping' responds correctly.  I can then kill `tcpdump', and until
    the next time I boot, the network works fine.  It's as if `tcpdump'
    changed something, and that change allows name resolution to work.

    So that's the deal.  Any ideas why my system is behaving this way, and
    what I can do about it?

    Thanks


Reply to: