Hi list,
I'm new to batman-adv, and I'm setting up some basic tests to verify my use case. I'm running into a roaming issue.
Consider a basic 2 node network running batman-adv 2014.2 with default settings on OpenWRT AA. Let the routing nodes be A and B. Each node has 2 wireless interfaces. wlan0 is for the mesh, and wlan0 is AP. The interfaces are bridged per the batman start guide.
Let there be 2 wifi clients, client 1 and 2. Initially, both clients are wirelessly attached to node A. Client 2 can ping 1 and the nodes. Client 2 can also Telnet into node A and B, so all is fine.
I take client 2 and roam to node B. Client 2 can no longer ping client 1 and that is the issue.
If 2 roamed back to A, pings to 1 is good again. A few more observations while pings are no good:
1) Client 2 can ping the nodes and the Telnet sessions to the nodes are fine. 2) Node B local translation table says 2 is at B, node A local translation table says 1 is at A. So the local translation tables check out. 3) Both nodes A and B can ping client 1, so client 1 is still up. 4) Running 'batctl td wlan1' on node 1 shows ICMP requests and replies, but 'batctl td bat0' shows only requests. So client 1 is getting the ICMP packets and is responding. 5) If I ran 'iw wlan1 station del <client 2 mac>' on node 1, pings will work again.
It almost looks like the wifi driver (ath5k) is blocking data for roamed clients that was once attached to it. So this issue might not be a batman but a driver thing. Has anyone ran into this problem?
Thanks,
Simon
Hi Simon,
On Mon, Jun 30, 2014 at 07:22:51PM -0700, Simon Wong wrote:
Let there be 2 wifi clients, client 1 and 2. Initially, both clients are wirelessly attached to node A. Client 2 can ping 1 and the nodes. Client 2 can also Telnet into node A and B, so all is fine.
I take client 2 and roam to node B. Client 2 can no longer ping client 1 and that is the issue.
At the Wireless Battle Mesh a few months ago we've been discussing just such a (until now?) hypothetical problem. Maybe it applies here, maybe it doesn't:
It could be a problem with a not yet updated MAC address table in the bridge, therefore the bridge on node A not forwarding ICMP requests from client 1 towards client 2.
Questions: Are your clients using IPv4, IPv6 or both? Are your clients issuing gratuitous ARP replies or ICMPv6 unsolicited Neighbor Advertisements upon roaming? Is this a permanent problem or are clients 1 and 2 able to reach each other again after a while? In your tests, did client 1 ping client 2 or the other way round?
What you could try to check whether it is a problem with the learning of the bridge is transforming them to stupid hubs on node A and node B:
$ brctl setageing br0 0
Cheers, Linus
On 01/07/14 10:50, Linus Lüssing wrote:
It could be a problem with a not yet updated MAC address table in the bridge, therefore the bridge on node A not forwarding ICMP requests from client 1 towards client 2.
Hey Linus,
I agree that the problem is probably in the bridge, but how can it be an inconsistency in the table given that the bridge is receiving the Echo requests from client 2 through bat0?
Shouldn't this immediately update the bridge table to reflect the client movement (client2 --is-behind--> bat0)?
@Simon: are you sure that the client is not associated anymore with node A at that moment (maybe it was jumping here and there)? You said that you can fix situation this by deleting the station entry, but is this station entry obsolete at that point? (meaning: is the inactivity time high? - you can see this through the "iw dev wlan0 station get <client2 mac>" command before deleting it) If not, it can be that something wrong is happening at the wifi layer and given the driver you are using (ath5k) it would not be totally unexpected.
I am asking this because I expect the station to disappear immediately in case of roaming (the client usually deauthenticates itself before associating with the new AP). Still, we can have cases when this does not happen, but the AP should be able to react properly.
Cheers,
A few more observations:
- Client 1 is a Win 7 machine, and Client 2 I have tried a Win 7 and OSX machine. In both cases the behavior is repeatable.
- All clients are on IPv4 only.
- I ran a Wireshark cap on the roaming client - no gratuitous ARP replies seen during the roam
- Client 2 is doing the pinging to Client 1
- The problem is permanent, and can be fixed by one of the below: - Manually delete the roaming client via 'iw station del' - Restart node A network stack (/etc/init.d/network restart), but which client attaches to which AP is not deterministic. - Client 2 roams back to A
- I tried 'brctl setaging 0' on node A's bridge, that didn't affect the behavior
- Running 'iw station get' on the 2 nodes during the problem yields some interesting results. On both nodes, the inactive time resets to 0 while the ping is running. If I stopped the ping, the inactive time on both nodes will rise as expected.
- Even more strange with 'iw station get' during the problem: interacting with the Telnet connection from Client 2 to Node A will also reset the inactive time count for Client 2, and this is while Client 2 is roamed to node B. On node A, only the tx {bytes, packets} counters will increase. rx counts do not. On node B, the tx/rx counts increase as expected.
- I am in a relatively small area, so even if Client 2 roamed to B, it is still within RF range of both nodes.
I mentioned before that both nodes' local translation tables were accurate after the roam. I also mentioned that doing a 'iw station del' will fix the problem. So, I took advantage of this and wrote a quick hack script to verify. The pseudo code is as follows:
while true run batctl tl and get current local client list compare current local client list with the last client list if (old list has clients that the new list doesn't have) run iw station del for those clients save current list to last client list sleep done
Terrible hack, but I was able to roam successfully while this script is running.
Thanks,
- Simon
On Tue, Jul 1, 2014 at 4:26 AM, Antonio Quartulli antonio@meshcoding.com wrote:
On 01/07/14 10:50, Linus Lüssing wrote:
It could be a problem with a not yet updated MAC address table in the bridge, therefore the bridge on node A not forwarding ICMP requests from client 1 towards client 2.
Hey Linus,
I agree that the problem is probably in the bridge, but how can it be an inconsistency in the table given that the bridge is receiving the Echo requests from client 2 through bat0?
Shouldn't this immediately update the bridge table to reflect the client movement (client2 --is-behind--> bat0)?
@Simon: are you sure that the client is not associated anymore with node A at that moment (maybe it was jumping here and there)? You said that you can fix situation this by deleting the station entry, but is this station entry obsolete at that point? (meaning: is the inactivity time high? - you can see this through the "iw dev wlan0 station get <client2 mac>" command before deleting it) If not, it can be that something wrong is happening at the wifi layer and given the driver you are using (ath5k) it would not be totally unexpected.
I am asking this because I expect the station to disappear immediately in case of roaming (the client usually deauthenticates itself before associating with the new AP). Still, we can have cases when this does not happen, but the AP should be able to react properly.
Cheers,
-- Antonio Quartulli
Simon,
On 02/07/14 03:24, Simon Wong wrote:
- Even more strange with 'iw station get' during the problem:
interacting with the Telnet connection from Client 2 to Node A will also reset the inactive time count for Client 2, and this is while Client 2 is roamed to node B. On node A, only the tx {bytes, packets} counters will increase. rx counts do not. On node B, the tx/rx counts increase as expected.
very stupid question: but the two nodes have different MAC addresses for wlan1, right ? I expect the answer to be yes, otherwise this would have probably created more problems...but just to be sure...
However there is something strange with the AP interface (as you already pointed out).. Did you see any deauth sent by the client while roaming
Cheers,
On 02/07/14 07:58, Antonio Quartulli wrote:
Simon,
On 02/07/14 03:24, Simon Wong wrote:
- Even more strange with 'iw station get' during the problem:
interacting with the Telnet connection from Client 2 to Node A will also reset the inactive time count for Client 2, and this is while Client 2 is roamed to node B. On node A, only the tx {bytes, packets} counters will increase. rx counts do not. On node B, the tx/rx counts increase as expected.
very stupid question: but the two nodes have different MAC addresses for wlan1, right ?
ehm, here I meant wlan0 (the AP interface where client connect to).
Antonio,
I haven't tried monitoring for deauths yet, but I have tried another device for the AP interface (a USB stick using ath9k_htc, on wlan2). I am able to repeat the same inter-AP roaming problem.
I was thinking this could have been a problem with the ath5k drivers, but that seems less likely.
Another observation: Let's say I'm roaming client 2 is attached to node A. I am monitoring client 2 on node A via `iw wlan1 station dump`. If I turned off client 2 WiFi or switched SSID, client 2 disappears from the station list as expected - a deauth probably got sent. I am guessing roaming might not trigger a deauth on the client. In any case, we can't count on deauth being received anyways.
Hypothesis: It seems as if the wireless driver/hardware has an internal forwarding rule. If the AP interface thinks it's got the client, it'll forward data internally to it and batman never sees the data and thus can't route it. But since the roam happened and another node has picked up the roaming client, translation tables updates are still triggered and states are still synchronized.
What do you think?
Thanks, - Simon
On Tue, Jul 1, 2014 at 11:22 PM, Antonio Quartulli antonio@meshcoding.com wrote:
On 02/07/14 07:58, Antonio Quartulli wrote:
Simon,
On 02/07/14 03:24, Simon Wong wrote:
- Even more strange with 'iw station get' during the problem:
interacting with the Telnet connection from Client 2 to Node A will also reset the inactive time count for Client 2, and this is while Client 2 is roamed to node B. On node A, only the tx {bytes, packets} counters will increase. rx counts do not. On node B, the tx/rx counts increase as expected.
very stupid question: but the two nodes have different MAC addresses for wlan1, right ?
ehm, here I meant wlan0 (the AP interface where client connect to).
-- Antonio Quartulli
Simon,
On 04/07/14 09:36, Simon Wong wrote:
I am guessing roaming might not trigger a deauth on the client.
at least a disassoc should be sent.
In any case, we can't count on deauth being received anyways.
of course, but we should rely on the layer below being working consistently.
Hypothesis: It seems as if the wireless driver/hardware has an internal forwarding rule. If the AP interface thinks it's got the client, it'll forward data internally to it and batman never sees the data and thus can't route it.
this is exactly how AP mode is supposed to work: if source and destination are connected to the same interface unicast traffic will not be delivered to the upper layer but will directly be forwarded to the destination.
But since the roam happened and another node has picked up the roaming client, translation tables updates are still triggered and states are still synchronized.
What do you think?
Looks like there is a problem at the wifi layer. batman-adv here is only playing the role of a generic Distribution System. The current behaviour would break any other backbone that you would have instead of batman-adv. The inactivity time getting reset when the client is connected to another AP is definitely a bogus behaviour and points towards a wifi problem.
At this point I would suggest you to involve the linux-wireless guys (they also have their own mailing list) and to try describing the problem to them. What I can say here is that batman-adv seems to be unrelated..
Cheers,
b.a.t.m.a.n@lists.open-mesh.org