in my small office-network we have _often_ the situation, that my laptop 8-) has no internet. if this happens, i can connect via ssh hop by hop and can see, that the network itself (wireless/wired) is working, but the 'routes' are wrong.
here the transglobal-table in the master, my laptop is '00:21:6a:32:7c:1c'
root@EG-labor-AP:~ batctl tg Globally announced TT entries received via the mesh bat0 Client (TTVN) Originator (Curr TTVN) (CRC ) Flags * 02:00:c0:ca:c0:1a ( 8) via 02:00:ca:b1:00:15 ( 8) (0xe285) [...] * 06:2f:65:8a:d2:b7 ( 1) via 02:00:ca:b1:00:02 ( 1) (0x50f6) [...] * 46:0a:75:3c:f2:47 ( 1) via 02:00:ca:b1:00:15 ( 8) (0xe285) [...] * 5e:27:29:d8:ee:b4 ( 1) via 02:00:ca:b1:00:76 ( 1) (0xc9fc) [...] * 72:f7:18:80:9d:9d ( 86) via 02:00:de:ad:00:03 ( 86) (0x7651) [...] * ae:d9:0f:ef:01:c3 ( 1) via 02:00:ca:b1:02:22 ( 5) (0x6456) [...] * 56:fb:55:27:b2:63 ( 1) via 02:00:ca:b1:00:58 ( 5) (0xac7c) [...] * 46:13:bf:2a:53:1e ( 1) via 02:00:ca:b1:00:13 ( 5) (0x8133) [...] * 0a:c6:fd:60:5d:5f ( 1) via 02:00:de:ad:02:23 ( 1) (0xa1c1) [...] ### interesting part: * 00:21:6a:32:7c:1c ( 4) via 02:00:ca:b1:02:22 ( 5) (0x6456) [.W.] + 00:21:6a:32:7c:1c ( 5) via 02:00:ca:b1:00:13 ( 5) [.W.] ### * e6:ad:ca:24:f6:10 ( 1) via 02:00:ca:b1:00:45 ( 3) (0x6182) [...]
root@EG-labor-AP:~ batctl -v batctl 2013.4.0 [batman-adv: 2013.4.0]
root@EG-labor-AP:~ cat /etc/openwrt_version r38568
the interesting thing is, that my laptop seems to be reachable via *:02:22 and *:00:13 - the 2nd entry has no hash (?), but 'batctl t 00:21:6a:32:7c:1c' outputs *:00:13 as originator. from the topology, it is impossible to be near this node, so no roaming can happen AND i can see on my laptop, that there was no roaming. the situation recovers without interaction after some minutes. the transglobal table does not change, but 'batctl t 00:21:6a:32:7c:1c' outputs the correct *:02:22
what can i do for more debugging or is this bug already solved in trunk?
bye, bastian
Hello Bastian,
On Fri, Nov 01, 2013 at 08:55:58AM +0100, Bastian Bittorf wrote:
in my small office-network we have _often_ the situation, that my laptop 8-) has no internet. if this happens, i can connect via ssh hop by hop and can see, that the network itself (wireless/wired) is working, but the 'routes' are wrong.
here the transglobal-table in the master, my laptop is '00:21:6a:32:7c:1c'
what is master?
[..]
the interesting thing is, that my laptop seems to be reachable via *:02:22 and *:00:13 - the 2nd entry has no hash (?), but 'batctl t 00:21:6a:32:7c:1c' outputs *:00:13 as originator. from the topology, it is impossible to be near this node, so no roaming can happen AND i can see on my laptop, that there was no roaming. the situation recovers without interaction after some minutes. the transglobal table does not change, but 'batctl t 00:21:6a:32:7c:1c' outputs the correct *:02:22
Here[1] you have an explanation about the translation table output.
My guess is that you have more than one node connected to the same LAN and BLA2 is properly enabled but some kind of L3 tricks on top of batman-adv is creating confusion in the network. I'd suggest to read (if you have not done it yet) [2].
[1] http://www.open-mesh.org/projects/batman-adv/wiki/Understand-your-batman-adv... [2] http://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II
Cheers,
* Antonio Quartulli antonio@meshcoding.com [01.11.2013 13:53]:
here the transglobal-table in the master, my laptop is '00:21:6a:32:7c:1c'
what is master?
the node which has internet connectitvity / default gateway.
the interesting thing is, that my laptop seems to be reachable via *:02:22 and *:00:13 - the 2nd entry has no hash (?), but 'batctl t 00:21:6a:32:7c:1c' outputs *:00:13 as originator. from the topology, it is impossible to be near this node, so no roaming can happen AND i can see on my laptop, that there was no roaming. the situation recovers without interaction after some minutes. the transglobal table does not change, but 'batctl t 00:21:6a:32:7c:1c' outputs the correct *:02:22
Here[1] you have an explanation about the translation table output.
thanks, this help: so [.W.] means: "this client is connected to the node through a wireless device"
Client (TTVN) Originator (Curr TTVN) (CRC ) Flags * 00:21:6a:32:7c:1c ( 4) via 02:00:ca:b1:02:22 ( 5) (0x6456) * [.W.] + 00:21:6a:32:7c:1c ( 5) via 02:00:ca:b1:00:13 ( 5) [.W.]
but i can be sure, that may laptop "00:21:6a:32:7c:1c" was never connected to '02:00:ca:b1:00:13'. both nodes are not connected via cable and are nodes in hybrid-mode (ap+adhoc). no special tricks, 'only' macvlan. BLA2 is active on all nodes.
the again: why does batman-adv think, that the client (my laptop) is/was reachable over 02:00:ca:b1:00:13 - the laptop was never there? a hash-collision?
what i also see now: a laptop is connected via wifi to NodeA, but i ask the 'transglobal' table, batman-adv says it is on another location and 'batctl tr $lapop' also works. explaining it:
NodeA = 192.168.99.1/16 ~~~ Laptop with 192.168.222.51/16
(air)
NodeB = 192.168.222.1/16
The Laptop is connected to Node A, but has an IP from Node B. batman-adv thinks that the Laptop is on NodeB, but in fact it is on NodeA. Why is this? On Node A 'wlan0' is bridged to bat0.
I can also see via pinging from Laptop 'dups' (2 answers).
bye, bastian
On Fri, Nov 01, 2013 at 03:33:05PM +0100, Bastian Bittorf wrote:
- Antonio Quartulli antonio@meshcoding.com [01.11.2013 13:53]:
here the transglobal-table in the master, my laptop is '00:21:6a:32:7c:1c'
what is master?
the node which has internet connectitvity / default gateway.
the interesting thing is, that my laptop seems to be reachable via *:02:22 and *:00:13 - the 2nd entry has no hash (?), but 'batctl t 00:21:6a:32:7c:1c' outputs *:00:13 as originator. from the topology, it is impossible to be near this node, so no roaming can happen AND i can see on my laptop, that there was no roaming. the situation recovers without interaction after some minutes. the transglobal table does not change, but 'batctl t 00:21:6a:32:7c:1c' outputs the correct *:02:22
Here[1] you have an explanation about the translation table output.
thanks, this help: so [.W.] means: "this client is connected to the node through a wireless device"
It also explains why you have more than one entry for the same client and why.
Client (TTVN) Originator (Curr TTVN) (CRC ) Flags
- 00:21:6a:32:7c:1c ( 4) via 02:00:ca:b1:02:22 ( 5) (0x6456) * [.W.]
- 00:21:6a:32:7c:1c ( 5) via 02:00:ca:b1:00:13 ( 5) [.W.]
but i can be sure, that may laptop "00:21:6a:32:7c:1c" was never connected to '02:00:ca:b1:00:13'. both nodes are not connected via cable and are nodes in hybrid-mode (ap+adhoc). no special tricks, 'only' macvlan. BLA2 is active on all nodes.
the again: why does batman-adv think, that the client (my laptop) is/was reachable over 02:00:ca:b1:00:13 - the laptop was never there? a hash-collision?
No. This happens when bat0 on one node and bat0 on the other are bridged together. The common scenario for this is that you have the two nodes connected to an Ethernet switch and you have bat0 bridged into this LAN. At this point the "two bat0s" will get in touch with each other. Like the first picture in this page[1].
The "only" macvlan thing is probably something we should try to investigate further :-) You are the first reporting strange issues like this and the fact that this happens quite often means that there is something in the network setup that is triggering this problem.
Do you mind explaining a bit more in details how you structured the node? (which interface is bridged with what, where macvlan is connected).
Can you also provide the output of "batctl bbt" ?
what i also see now: a laptop is connected via wifi to NodeA, but i ask the 'transglobal' table, batman-adv says it is on another location and 'batctl tr $lapop' also works. explaining it:
NodeA = 192.168.99.1/16 ~~~ Laptop with 192.168.222.51/16
(air)
NodeB = 192.168.222.1/16
The Laptop is connected to Node A, but has an IP from Node B. batman-adv thinks that the Laptop is on NodeB, but in fact it is on NodeA. Why is this? On Node A 'wlan0' is bridged to bat0.
I guess you roamed from NodeB to NodeA ? Is the entry in the global table followed by a "R"
Cheers,
[1] http://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II
* Antonio Quartulli antonio@meshcoding.com [01.11.2013 16:04]:
both nodes are not connected via cable and are nodes in hybrid-mode (ap+adhoc). no special tricks, 'only' macvlan. BLA2 is active on all nodes.
the again: why does batman-adv think, that the client (my laptop) is/was reachable over 02:00:ca:b1:00:13 - the laptop was never there? a hash-collision?
No. This happens when bat0 on one node and bat0 on the other are bridged together. The common scenario for this is that you have the two nodes connected to an Ethernet switch and you have bat0 bridged into this LAN. At this point the "two bat0s" will get in touch with each other. Like the first picture in this page[1].
The "only" macvlan thing is probably something we should try to investigate further :-) You are the first reporting strange issues like this and the fact that this happens quite often means that there is something in the network setup that is triggering this problem.
All nodes are in 'hybrid' mode, so adhoc+ap on 1 or more radio's. Each interface, e.g. LAN/WAN/ADHOC is an batman-adv interface, each AP-Mode/hostapd-interfaces is bridged to bat0, so it looks like:
root@node15hybrid:~ batctl interface eth0.1: active # LAN eth0.2: active # WAN wlan0-1: active # adhoc-2.4ghz wlan1-1: active # adhoc-5ghz
root@node15hybrid:~ brctl show bridge name bridge id STP enabled interfaces br-mybridge 7fff.460a753cf247 no bat0 wlan0 # AP-2.4ghz wlan1 # AP-5ghz
A few number of nodes are coupled via wire (this works). Each node has an IP of 192.168.x.1/16 where X is a uniq number.
Each node has a macvlan called 'gateway0' which has the IP 192.168.0.1/32 This is just an IP which every DHCP-Client gets for "default-gateway". (so the gateway is the node itself and not the internet-offering-node). This looks like this:
root@node222hybrid:~ ip address show dev gateway0 15: gateway0@br-mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:00:c0:ca:c0:1a brd ff:ff:ff:ff:ff:ff inet 192.168.0.1/32 scope global gateway0 valid_lft forever preferred_lft forever inet6 fe80::c0ff:feca:c01a/64 scope link valid_lft forever preferred_lft forever
Each node is a batman-adv gateway, so 'batctl gwl' outputs every node. (so DHCP-questions are not forwarded but ansered locally).
The backbone-table seems to be empty on every node.
Does this help? bye, bastian
On 11/01/2013 04:16 PM, Bastian Bittorf wrote:
- Antonio Quartulli antonio@meshcoding.com [01.11.2013 16:04]:
both nodes are not connected via cable and are nodes in hybrid-mode (ap+adhoc). no special tricks, 'only' macvlan. BLA2 is active on all nodes.
the again: why does batman-adv think, that the client (my laptop) is/was reachable over 02:00:ca:b1:00:13 - the laptop was never there? a hash-collision?
No. This happens when bat0 on one node and bat0 on the other are bridged together. The common scenario for this is that you have the two nodes connected to an Ethernet switch and you have bat0 bridged into this LAN. At this point the "two bat0s" will get in touch with each other. Like the first picture in this page[1].
The "only" macvlan thing is probably something we should try to investigate further :-) You are the first reporting strange issues like this and the fact that this happens quite often means that there is something in the network setup that is triggering this problem.
All nodes are in 'hybrid' mode, so adhoc+ap on 1 or more radio's. Each interface, e.g. LAN/WAN/ADHOC is an batman-adv interface, each AP-Mode/hostapd-interfaces is bridged to bat0, so it looks like:
root@node15hybrid:~ batctl interface eth0.1: active # LAN eth0.2: active # WAN wlan0-1: active # adhoc-2.4ghz wlan1-1: active # adhoc-5ghz
root@node15hybrid:~ brctl show bridge name bridge id STP enabled interfaces br-mybridge 7fff.460a753cf247 no bat0 wlan0 # AP-2.4ghz wlan1 # AP-5ghz
A few number of nodes are coupled via wire (this works). Each node has an IP of 192.168.x.1/16 where X is a uniq number.
Each node has a macvlan called 'gateway0' which has the IP 192.168.0.1/32 This is just an IP which every DHCP-Client gets for "default-gateway". (so the gateway is the node itself and not the internet-offering-node). This looks like this:
root@node222hybrid:~ ip address show dev gateway0 15: gateway0@br-mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:00:c0:ca:c0:1a brd ff:ff:ff:ff:ff:ff
if this mac (02:00:c0:ca:c0:1a) exists on several different nodes that are not connected by a real ethernet backbone (and BLA2 enabled) then batman goes mushroom tripping, since it 'sees' that MAC as a non-mesh-client that is everywhere at the same time, and tries to roam it around, creating funny symptoms (like DUPs and such)
if all nodes are actually connected to an ethernet backbone, then BLA2 is supposed to save the day, by properly handling the situation. (haven't actually tried it, tho)
what we did is avoid (the best we can) those packets to be sent over bat0, with ebtables
# cat /etc/firewall.user ebtables -A FORWARD -j DROP -d 02:00:c0:ca:c0:1a ebtables -t nat -A POSTROUTING -o bat0 -j DROP -s 02:00:c0:ca:c0:1a
hope that helps!
inet 192.168.0.1/32 scope global gateway0 valid_lft forever preferred_lft forever inet6 fe80::c0ff:feca:c01a/64 scope link valid_lft forever preferred_lft forever
Each node is a batman-adv gateway, so 'batctl gwl' outputs every node. (so DHCP-questions are not forwarded but ansered locally).
i wouldn't be so sure... AFAIU when a request arrives at a gw_mode=master, bat0 passes it upstream (to br-lan) as a broadcast, so that it will reach either a local dnsmasq, or another DHCP server running on the lan behind (say, connected to eth0 which is part of br-lan)
(i used that setup several times; a batadv gw_mode=master node with no local dnsmasq, but another dhcp server connected via ethernet behind)
The backbone-table seems to be empty on every node.
Does this help? bye, bastian
* Gui Iribarren gui@altermundi.net [03.11.2013 09:52]:
Each node has a macvlan called 'gateway0' which has the IP 192.168.0.1/32 This is just an IP which every DHCP-Client gets for "default-gateway". (so the gateway is the node itself and not the internet-offering-node). This looks like this:
root@node222hybrid:~ ip address show dev gateway0 15: gateway0@br-mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:00:c0:ca:c0:1a brd ff:ff:ff:ff:ff:ff
if this mac (02:00:c0:ca:c0:1a) exists on several different nodes that are not connected by a real ethernet backbone (and BLA2 enabled) then batman goes mushroom tripping, since it 'sees' that MAC as a non-mesh-client that is everywhere at the same time, and tries to roam it around, creating funny symptoms (like DUPs and such)
BINGO! thank you Gui - if i read the old mails, i can even see it i the transglobal table. if i look into the mesh, it pop's up on random nodes with random originators. yes: mushroom tripping 8-)
i will try the ebtables approach, but i dont like it. IMHO it's more elegant to just 'ignore' this mac by the daemon itself:
To the devs: is this possible?
bye, bastian
On 11/03/2013 10:18 AM, Bastian Bittorf wrote:
- Gui Iribarren gui@altermundi.net [03.11.2013 09:52]:
Each node has a macvlan called 'gateway0' which has the IP 192.168.0.1/32 This is just an IP which every DHCP-Client gets for "default-gateway". (so the gateway is the node itself and not the internet-offering-node). This looks like this:
root@node222hybrid:~ ip address show dev gateway0 15: gateway0@br-mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:00:c0:ca:c0:1a brd ff:ff:ff:ff:ff:ff
if this mac (02:00:c0:ca:c0:1a) exists on several different nodes that are not connected by a real ethernet backbone (and BLA2 enabled) then batman goes mushroom tripping, since it 'sees' that MAC as a non-mesh-client that is everywhere at the same time, and tries to roam it around, creating funny symptoms (like DUPs and such)
BINGO! thank you Gui - if i read the old mails, i can even see it i the transglobal table. if i look into the mesh, it pop's up on random nodes with random originators. yes: mushroom tripping 8-)
i will try the ebtables approach, but i dont like it. IMHO it's more elegant to just 'ignore' this mac by the daemon itself:
According to the (misleading, by definition :P) docs
http://www.open-mesh.org/projects/open-mesh/wiki/Connecting-Batman-adv-cloud...
it shouldn't quite ignore it, but instead properly support this "anycast" MAC, taking advantage of the "Bridge Loop Avoidance II component" even when there's no physical backbone between nodes.
meanwhile, at the batcave... (2013/10/14)
d0tslash: guii: we don't have anycast support d0tslash: yet guii: ...? d0tslash: if you have the same mac address on multiple nodes (without bla), it will roam guii: oh! d0tslash: hm, it might work if you have bla enabled and the nodes are connected via the same ethernet d0tslash: but otherweise it won't work d0tslash: because it is supposed to roam :)
d0tslash: so you won't have that feature for now, i'm afraid d0tslash: it is still on our "feature todo" list
so, given there will be MACs that roam (laptops, phones...) and MACs that don't (anycast), i can imagine some kind of regexp matching that will say "don't roam these kind of MACs, instead, consider them anycast macs and use bla2 magic"
# batctl anycast EE:C4:57:00:00:00/32
...until then, ebtables WORKSFORME :D and all this doesn't make batman-adv any less awesome than what it was already ;)
btw, even with the ebtables rule, we had to turn off DAT in a scenario equivalent to yours, because the DAT cache was also acting funny (DUP arp replies from each node in the cloud) haven't got around to properly debug it / report it, but still, be warned :)
To the devs: is this possible?
bye, bastian
* Gui Iribarren gui@altermundi.net [03.11.2013 20:37]:
.....until then, ebtables WORKSFORME :D and all this doesn't make batman-adv any less awesome than what it was already ;)
btw, even with the ebtables rule, we had to turn off DAT in a scenario equivalent to yours, because the DAT cache was also acting funny (DUP arp replies from each node in the cloud) haven't got around to properly debug it / report it, but still, be warned :)
since a few days we have running those 2 ebtable-rules on all nodes: ebtables -A FORWARD -j DROP -d "$mac_gateway" ebtables -t nat -A POSTROUTING -o bat0 -j DROP -s "$mac_gateway"
it looks like this: root@box:~ ebtables -L FORWARD --Lc Bridge table: filter
Bridge chain: FORWARD, entries: 1, policy: ACCEPT -d 2:0:c0:ca:c0:1a -j DROP , pcnt = 9581 -- bcnt = 1116077
root@box:~ ebtables -t nat -L POSTROUTING --Lc Bridge table: nat
Bridge chain: POSTROUTING, entries: 1, policy: ACCEPT -s 2:0:c0:ca:c0:1a -o bat0 -j DROP , pcnt = 4 -- bcnt = 352
so most of the time it is working fine. but we have seen another issue, but i'am unsure where is it coming from:
"clients time out in translocal-table"
A laptop connected to always the same router / no roaming involved times out in 'translocal-table' and so it also times out on the other nodes in the 'transglobal-table', so it is not reachable anymore.
a bad translocal-table/dat-cache with this client looks like this: (i have removed other clients, for better readablility)
root@box:~ batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 2 CRC: 0x6023): Client Flags Last seen * 00:21:6a:32:7c:1c [....W] 0.010
root@box:~ batctl dc Distributed ARP Table (bat0): IPv4 MAC last-seen * 192.168.222.61 00:21:6a:32:7c:1c 3:50
after some seconds the client disappaers from DAT-cache:
root@box:~ batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 2 CRC: 0x6023): Client Flags Last seen * 00:21:6a:32:7c:1c [....W] 0.010
root@box:~ batctl dc Distributed ARP Table (bat0): IPv4 MAC last-seen
after some time even the 'translocal-table' is empty, although with 'iw dev wlan0 station dump' i can see the active client. i'm normally connected, can ping/ssh the node itself but not further. (only hop by hop)
how does batman detect, if a client is active? (can i trigger is somehow?) what can i do tho debug further?
thanks & bye, bastian
On Sat, Nov 23, 2013 at 10:24:45AM +0100, Bastian Bittorf wrote:
- Gui Iribarren gui@altermundi.net [03.11.2013 20:37]:
btw, even with the ebtables rule, we had to turn off DAT in a scenario equivalent to yours, because the DAT cache was also acting funny (DUP arp replies from each node in the cloud) haven't got around to properly debug it / report it, but still, be warned :)
I think all these strange behaviours are coming from the fact that what you guys are trying to do is not really supported by the underlying layer (batman-adv).
I think a better idea is to start thinking how to bring anycast support in batman-adv other than trying to mess up the rest :) That would surely help the entire community.
After the last WBM we concentrated our efforts in creating a starting point for a "more general" solution and we collected the results in this page [*].
This page describes what you probably want to achieve at the end, so working all together to make it possible would probably be the best option (instead of trying to workaround unsupported setup and then asking for help to debug inconsistent behaviours....).
root@box:~ batctl dc Distributed ARP Table (bat0): IPv4 MAC last-seen
- 192.168.222.61 00:21:6a:32:7c:1c 3:50
after some seconds the client disappaers from DAT-cache:
As you can imagine DAT is a cache and if it does not get refreshed often enough the content will slowly disappear. Right now the timeout is 4 minutes and this is why "after" few second your entry goes away (it is at 3:50 at that moment). If you have not yet read the documentation, [1] explains the mechanism behind it.
root@box:~ batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 2 CRC: 0x6023): Client Flags Last seen
- 00:21:6a:32:7c:1c [....W] 0.010
root@box:~ batctl dc Distributed ARP Table (bat0): IPv4 MAC last-seen
after some time even the 'translocal-table' is empty, although with 'iw dev wlan0 station dump' i can see the active client. i'm normally connected, can ping/ssh the node itself but not further. (only hop by hop)
how does batman detect, if a client is active? (can i trigger is somehow?) what can i do tho debug further?
As written in [2]: "Every client MAC address that is recognized through the mesh interface will be stored in a node local table called "local translation table" which will contain all the clients the node is currently serving."
So if your client is timing out it means that no packet originated by it is reaching your mesh interface.
If you want to debug further now you have to ask yourself what are you doing to prevent packets to reach bat0 :-)
Cheers,
[*] http://www.open-mesh.org/projects/open-mesh/wiki/Connecting-Batman-adv-cloud... [1] http://www.open-mesh.org/projects/batman-adv/wiki/DistributedArpTable-techni... [2] http://www.open-mesh.org/projects/batman-adv/wiki/Client-announcement
b.a.t.m.a.n@lists.open-mesh.org