My apologies up front for a newbie question in this apparently very technical list. If there is a more appropriate list or forum please direct me to it.
I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL -AR150 platform. It all works swimmingly, except that sometimes, for no apparent reason, layer 3 connectivity across the mesh (inside it) is lost. All nodes are still up and running and I can log into them via their APs or their LAN interfaces. batctl on any node still sees all the other nodes, and batctl can ping them by MAC address. Each node still has its layer3 address and can ping it locally. The arp data for the other nodes is correct (at least in my two-node test system). I don't seem to be able to delete arp entries.
But no node can ping any other node by IP address. Restarting networking on any or even all nodes doesn't help - all that helps is rebooting everything.
As long as my mesh is six nodes on a table, rebooting everything is an option. Once deployed - not so much :-)
It's possible to engineer this failure in a two-node mesh by just restarting networking a few times quickly.
So I'm wondering a) if this is a known issue b) if this is an obvious symptom of some stuff-up on my part, c) if there's a quick fix :-) and d) failing all of the above whether there is some way out of the situation without having to reboot every time.
The mesh has one node configured as a gateway, connected to an upstream, and running DHCP on the bat interface. The other nodes are configured as gateway clients and have nothing connected on their LAN or WAN ports. All nodes have a static address on the bat0 interface, all have a static route to the gateway's (inside) IP address. All have an AP on the same radio as the mesh, with an RFC1918 network on it masqueraded into the mesh.
And mostly, it works :-)
Yours hopefully, K.
On Monday 02 May 2016 21:57:49 Karl Auer wrote:
My apologies up front for a newbie question in this apparently very technical list. If there is a more appropriate list or forum please direct me to it.
I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL -AR150 platform.
Are you using v2016.1 or some older version of batman-adv? If you use something like v2014.4.
What kind of layer 3 are you using? IPv4/IPv6/...? What is you current configuration (for example are you have enabled DAT, BLA, ...). Did you check what exactly goes over the air and what the device (the adhoc one) receives/sends? What is what the data sent/received over the batman-adv devices?
Did you hardcode the mac address of the batman-adv device or are you let it change to a random value on each device creation? Is the device part of a bridge or is the IP configured directly on the batman-adv device?
Are you sure that the conntrack for the masquerade over the mesh isn't broken? Why are you masquerade over the mesh anyway?
Kind regards, Sven
I've run into what sounds like a similar problem, but dove in and found more details. Here's the setup:
-19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link). -bat0 using ad-hoc interface on each node -bat0 bridged (br-lan) with Ethernet -br-lan on all nodes (except node 1) get DHCPv4 address from dnsmasq running on node 1 -A few PCs are hard-wired to LAN on node 1, one PC wired to LAN at node 11, all other nodes completely standalone -WAN on node 1 connected to building network - only connection to any outside network -DAT enabled -BLA disabled
Problem: After several weeks uptime, node 1 could no longer SSH or ping (L3) node 3. Tcpdumps showed ping rec'd at node 3 and node 3 replied, but reply never arrived at node 1. Linux PC wired to LAN at node 1 successfully pings node 3. L2 ping (via batctl) works between nodes 1 and 3. Further investigation showed two entries for node 1's br-lan MAC in the global translation table at node 3. Secondary entry was correct; primary entry pointed to node 4. Node 4's tables (local and global) were both correct.
root@WifiMesh-03:~# batctl tg | grep c8:d3:a3:70:a9:b0 * 42:5e:78:f3:50:7e 0 ( 2) via c8:d3:a3:70:a9:b0 ( 25) (0xd7886ba8) [....] * c8:d3:a3:70:a9:b0 -1 ( 19) via c8:d3:a3:70:a9:53 ( 19) (0x10e4856e) [....] + c8:d3:a3:70:a9:b0 -1 ( 2) via c8:d3:a3:70:a9:b0 ( 25) (0x352c5b78) [....] * 42:5e:78:f3:50:7e -1 ( 2) via c8:d3:a3:70:a9:b0 ( 25) (0x352c5b78) [....]
(Yes, br-lan and adhoc0 have same MAC on node 1. Yes, these are D-Link routers.) ...50:7e is bat0 at node 1, ...a9:b0 is adhoc0/br-lan at node 1, ...a9:53 is adhoc0/br-lan at node 4
This part may be odd: problem persisted for a few days while I investigated, but resolved immediately after viewing the tables on node 4. May be coincidence, though, because it didn't work for the following nodes.
At same time, same problem existed with two other nodes on the mesh: node 13 (an OM2P-HS) matched node 3's global table; node 9 (a WRTnode) showed a primary entry for node 1's br-lan using yet another originator. Rebooted nodes to resolve. Problem happened again more recently, but the destination MAC was that of the Linux PC mentioned above, attached to the LAN on node 1. In this case, most nodes' global tables showed two entries for that MAC, though the originator in the primary entry was not consistent.
I plan to test with a more recent version of BATMAN-adv once I standardize on one model of hardware for the nodes (should be within the next few months). In the meantime, I plan to watch for secondary entries in the global translation tables since the current configuration should never result one client being accessible through multiple nodes.
Thanks, -Nick
-----Original Message----- From: B.A.T.M.A.N [mailto:b.a.t.m.a.n-bounces@lists.open-mesh.org] On Behalf Of Sven Eckelmann Sent: Monday, May 02, 2016 9:55 AM To: b.a.t.m.a.n@lists.open-mesh.org Subject: Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
On Monday 02 May 2016 21:57:49 Karl Auer wrote:
My apologies up front for a newbie question in this apparently very technical list. If there is a more appropriate list or forum please direct me to it.
I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL -AR150 platform.
Are you using v2016.1 or some older version of batman-adv? If you use something like v2014.4.
What kind of layer 3 are you using? IPv4/IPv6/...? What is you current configuration (for example are you have enabled DAT, BLA, ...). Did you check what exactly goes over the air and what the device (the adhoc one) receives/sends? What is what the data sent/received over the batman-adv devices?
Did you hardcode the mac address of the batman-adv device or are you let it change to a random value on each device creation? Is the device part of a bridge or is the IP configured directly on the batman-adv device?
Are you sure that the conntrack for the masquerade over the mesh isn't broken? Why are you masquerade over the mesh anyway?
Kind regards, Sven
On Thursday 19 May 2016 14:22:58 Nick Schaf wrote:
I've run into what sounds like a similar problem, but dove in and found more details. Here's the setup:
-19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link).
Only scrolled through your mail. But there are two things which I find odd. First you use a really old (actually not existing) version of batman-adv.
Then you have some TT problems. I think we had many fixes since then which may be related to your problem. But going through 2 years of fixes might be a quite hard (at very cumbersome) journey.
Maybe it is really a good idea to try to upgrade to the recent version (2016.1+fixes) from the Chaos Calmers openwrt-routing feed on all your nodes. Maybe Antonio remembers one special TT/roaming bug and can recommend one to test. But most likely testing the current version is easier.
But thanks for gathering all the info
Kind regards, Sven
On Fri, May 20, 2016 at 09:23:24AM +0200, Sven Eckelmann wrote:
On Thursday 19 May 2016 14:22:58 Nick Schaf wrote:
I've run into what sounds like a similar problem, but dove in and found more details. Here's the setup:
-19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link).
Only scrolled through your mail. But there are two things which I find odd. First you use a really old (actually not existing) version of batman-adv.
Then you have some TT problems. I think we had many fixes since then which may be related to your problem. But going through 2 years of fixes might be a quite hard (at very cumbersome) journey.
Maybe it is really a good idea to try to upgrade to the recent version (2016.1+fixes) from the Chaos Calmers openwrt-routing feed on all your nodes. Maybe Antonio remembers one special TT/roaming bug and can recommend one to test. But most likely testing the current version is easier.
I don't recall any superfix which might magically solve the problems you are seeing. Therefore I'd just follow Sven's suggestion and try running a recent version of batman-adv.
Cheers,
b.a.t.m.a.n@lists.open-mesh.org