Re: Weird No route to host
Glen Reesor <Glen.Reesor@telus.net> writes:
>
> It is indeed a wireless problem.  I have confirmed that there is no
> problem when jg-laptop is wired.  Based on my limited knowledge of
> ARP, this appears to makes sense--ARP is failing on jg-laptop, but
> the arp cache gets populated when it receives packets from jg-tower.
Glen, I had the same problem and can confirm that Michael has
diagnosed it correctly. Note that, in your configuration, the bug is
in *jg-tower*'s wireless driver, not the ath5k driver on the laptop.
The tower is failing to receive broadcast ARPs from the laptop, and
I'm willing to bet that your tower is running one of the rt2x00
drivers (rt61, rt73, rt2x00) and that the problem disappears if you
turn on the tower driver's "nohwcrypt=1" module parameter.
I ran into the problem with the rt73usb driver, and when I looked at
the RT driver code, I found that the hardware encryption support was
full of bugs: data was written to the wrong registers or the wrong bit
fields within registers, keys overwrote each other, and it was a
miracle that any of it could work at all. (In fact an old Hardy kernel
driver worked okay and a Karmic upgrade broke it because, in Hardy, a
fatal bug prevented hardware encryption from being used at all, and
when this bug was fixed all the other bugs came to the surface.)
The exact nature of the problem is that, on a WPA network, different
keys are used for unicast and broadcast traffic (unicast uses a key
specific to the pair of machines, broadcast uses a group key).
Because of the above bugs, it was easy for the driver to get into a
state where it was unable to decrypt and accept broadcast traffic.
In this state, the symptoms are exactly as you described. The machine
with the bad driver (in your case jg-tower) becomes unreachable to any
other machines on the wireless that don't already know its Ethernet
address because it does not see any broadcast ARPs. Of course,
connections can be initiated from the affected machine, which can send
broadcast ARP requests and receive ARP replies and then maintain a
unicast connection (with unicast ARPs) indefinitely.
The precise details of ARP timeout look pretty complicated, but I
believe an ARP entry will stay REACHABLE for 15-45 seconds and then,
as experimentation shows, STALE for about 10 minutes before being
deleted entirely. While the entry is STALE, it will be used to unicast
new ARPs before falling back on broadcast ARPs, and these unicast ARPs
work fine, so I'm not surprised you're seeing a 10 minute delay with
no traffic before the problem crops up again.
You could try disabling hardware encryption on jg-tower, with a
"nohwcrypt=1" module parameter, and see if that fixes the
problem.
-- 
Kevin <buhr+debian@asaurus.net>
Reply to: