[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Weird No route to host



Thanks for the detailed description.  Your description of arp failures being on jg-tower makes sense, and it also makes sense given that I've had these computers for a couple of years and just started having troubles when I switched the tower to wireless (new house--no phone jack in the office for my highspeed).

Unfortunately the driver used by jg-tower is ath9k, and I see no reference to this type of problem being fixed by nohwcrypt=1 (on the ath9k website).

To make matters more confusing, I now find that I am unable to reliable reproduce my problem--partly because I don't understand when a host will send out packets announcing itself.  I know that if jg-tower attempts connecting to jg-laptop, my problem disappears, but it also seems that when jg-tower brings up it's wireless, something ends up at jg-laptop to provide initial "awareness" of jg-tower.

I thought I had this addressed by booting jg-tower and enabling it's wireless before booting jg-laptop, but I'm now seeing situations where jg-laptop knows of jg-tower's address right away, and doesn't disappear.

Long story short, I can't reliably reproduce the problem, thus am unable to reliably say something has fixed it!  I even tried ndiswrapper, but it appeared to exhibit the same problem, which leaves me totally confused!

Back to my cron ping solution....

Glen Reesor


Kevin Buhr wrote:
Glen Reesor <Glen.Reesor@telus.net> writes:
  
It is indeed a wireless problem.  I have confirmed that there is no
problem when jg-laptop is wired.  Based on my limited knowledge of
ARP, this appears to makes sense--ARP is failing on jg-laptop, but
the arp cache gets populated when it receives packets from jg-tower.
    
Glen, I had the same problem and can confirm that Michael has
diagnosed it correctly. Note that, in your configuration, the bug is
in *jg-tower*'s wireless driver, not the ath5k driver on the laptop.
The tower is failing to receive broadcast ARPs from the laptop, and
I'm willing to bet that your tower is running one of the rt2x00
drivers (rt61, rt73, rt2x00) and that the problem disappears if you
turn on the tower driver's "nohwcrypt=1" module parameter.

I ran into the problem with the rt73usb driver, and when I looked at
the RT driver code, I found that the hardware encryption support was
full of bugs: data was written to the wrong registers or the wrong bit
fields within registers, keys overwrote each other, and it was a
miracle that any of it could work at all. (In fact an old Hardy kernel
driver worked okay and a Karmic upgrade broke it because, in Hardy, a
fatal bug prevented hardware encryption from being used at all, and
when this bug was fixed all the other bugs came to the surface.)

The exact nature of the problem is that, on a WPA network, different
keys are used for unicast and broadcast traffic (unicast uses a key
specific to the pair of machines, broadcast uses a group key).
Because of the above bugs, it was easy for the driver to get into a
state where it was unable to decrypt and accept broadcast traffic.

In this state, the symptoms are exactly as you described. The machine
with the bad driver (in your case jg-tower) becomes unreachable to any
other machines on the wireless that don't already know its Ethernet
address because it does not see any broadcast ARPs. Of course,
connections can be initiated from the affected machine, which can send
broadcast ARP requests and receive ARP replies and then maintain a
unicast connection (with unicast ARPs) indefinitely.

The precise details of ARP timeout look pretty complicated, but I
believe an ARP entry will stay REACHABLE for 15-45 seconds and then,
as experimentation shows, STALE for about 10 minutes before being
deleted entirely. While the entry is STALE, it will be used to unicast
new ARPs before falling back on broadcast ARPs, and these unicast ARPs
work fine, so I'm not surprised you're seeing a 10 minute delay with
no traffic before the problem crops up again.

You could try disabling hardware encryption on jg-tower, with a
"nohwcrypt=1" module parameter, and see if that fixes the
problem.

  

Reply to: