Thanks for the detailed description. Your description of arp failures
being on jg-tower makes sense, and it also makes sense given that I've
had these computers for a couple of years and just started having
troubles when I switched the tower to wireless (new house--no phone
jack in the office for my highspeed). Unfortunately the driver used by jg-tower is ath9k, and I see no reference to this type of problem being fixed by nohwcrypt=1 (on the ath9k website). To make matters more confusing, I now find that I am unable to reliable reproduce my problem--partly because I don't understand when a host will send out packets announcing itself. I know that if jg-tower attempts connecting to jg-laptop, my problem disappears, but it also seems that when jg-tower brings up it's wireless, something ends up at jg-laptop to provide initial "awareness" of jg-tower. I thought I had this addressed by booting jg-tower and enabling it's wireless before booting jg-laptop, but I'm now seeing situations where jg-laptop knows of jg-tower's address right away, and doesn't disappear. Long story short, I can't reliably reproduce the problem, thus am unable to reliably say something has fixed it! I even tried ndiswrapper, but it appeared to exhibit the same problem, which leaves me totally confused! Back to my cron ping solution.... Glen Reesor Kevin Buhr wrote: Glen Reesor <Glen.Reesor@telus.net> writes:It is indeed a wireless problem. I have confirmed that there is no problem when jg-laptop is wired. Based on my limited knowledge of ARP, this appears to makes sense--ARP is failing on jg-laptop, but the arp cache gets populated when it receives packets from jg-tower.Glen, I had the same problem and can confirm that Michael has diagnosed it correctly. Note that, in your configuration, the bug is in *jg-tower*'s wireless driver, not the ath5k driver on the laptop. The tower is failing to receive broadcast ARPs from the laptop, and I'm willing to bet that your tower is running one of the rt2x00 drivers (rt61, rt73, rt2x00) and that the problem disappears if you turn on the tower driver's "nohwcrypt=1" module parameter. I ran into the problem with the rt73usb driver, and when I looked at the RT driver code, I found that the hardware encryption support was full of bugs: data was written to the wrong registers or the wrong bit fields within registers, keys overwrote each other, and it was a miracle that any of it could work at all. (In fact an old Hardy kernel driver worked okay and a Karmic upgrade broke it because, in Hardy, a fatal bug prevented hardware encryption from being used at all, and when this bug was fixed all the other bugs came to the surface.) The exact nature of the problem is that, on a WPA network, different keys are used for unicast and broadcast traffic (unicast uses a key specific to the pair of machines, broadcast uses a group key). Because of the above bugs, it was easy for the driver to get into a state where it was unable to decrypt and accept broadcast traffic. In this state, the symptoms are exactly as you described. The machine with the bad driver (in your case jg-tower) becomes unreachable to any other machines on the wireless that don't already know its Ethernet address because it does not see any broadcast ARPs. Of course, connections can be initiated from the affected machine, which can send broadcast ARP requests and receive ARP replies and then maintain a unicast connection (with unicast ARPs) indefinitely. The precise details of ARP timeout look pretty complicated, but I believe an ARP entry will stay REACHABLE for 15-45 seconds and then, as experimentation shows, STALE for about 10 minutes before being deleted entirely. While the entry is STALE, it will be used to unicast new ARPs before falling back on broadcast ARPs, and these unicast ARPs work fine, so I'm not surprised you're seeing a 10 minute delay with no traffic before the problem crops up again. You could try disabling hardware encryption on jg-tower, with a "nohwcrypt=1" module parameter, and see if that fixes the problem. |