I've run into what sounds like a similar problem, but dove in and found more details. Here's the setup:
-19 nodes running BATMAN-adv 2014.14.0 on OpenWrt Chaos Calmer; various hardware (D-Link, BBB, Open-Mesh, WRTnode, TP-Link). -bat0 using ad-hoc interface on each node -bat0 bridged (br-lan) with Ethernet -br-lan on all nodes (except node 1) get DHCPv4 address from dnsmasq running on node 1 -A few PCs are hard-wired to LAN on node 1, one PC wired to LAN at node 11, all other nodes completely standalone -WAN on node 1 connected to building network - only connection to any outside network -DAT enabled -BLA disabled
Problem: After several weeks uptime, node 1 could no longer SSH or ping (L3) node 3. Tcpdumps showed ping rec'd at node 3 and node 3 replied, but reply never arrived at node 1. Linux PC wired to LAN at node 1 successfully pings node 3. L2 ping (via batctl) works between nodes 1 and 3. Further investigation showed two entries for node 1's br-lan MAC in the global translation table at node 3. Secondary entry was correct; primary entry pointed to node 4. Node 4's tables (local and global) were both correct.
root@WifiMesh-03:~# batctl tg | grep c8:d3:a3:70:a9:b0 * 42:5e:78:f3:50:7e 0 ( 2) via c8:d3:a3:70:a9:b0 ( 25) (0xd7886ba8) [....] * c8:d3:a3:70:a9:b0 -1 ( 19) via c8:d3:a3:70:a9:53 ( 19) (0x10e4856e) [....] + c8:d3:a3:70:a9:b0 -1 ( 2) via c8:d3:a3:70:a9:b0 ( 25) (0x352c5b78) [....] * 42:5e:78:f3:50:7e -1 ( 2) via c8:d3:a3:70:a9:b0 ( 25) (0x352c5b78) [....]
(Yes, br-lan and adhoc0 have same MAC on node 1. Yes, these are D-Link routers.) ...50:7e is bat0 at node 1, ...a9:b0 is adhoc0/br-lan at node 1, ...a9:53 is adhoc0/br-lan at node 4
This part may be odd: problem persisted for a few days while I investigated, but resolved immediately after viewing the tables on node 4. May be coincidence, though, because it didn't work for the following nodes.
At same time, same problem existed with two other nodes on the mesh: node 13 (an OM2P-HS) matched node 3's global table; node 9 (a WRTnode) showed a primary entry for node 1's br-lan using yet another originator. Rebooted nodes to resolve. Problem happened again more recently, but the destination MAC was that of the Linux PC mentioned above, attached to the LAN on node 1. In this case, most nodes' global tables showed two entries for that MAC, though the originator in the primary entry was not consistent.
I plan to test with a more recent version of BATMAN-adv once I standardize on one model of hardware for the nodes (should be within the next few months). In the meantime, I plan to watch for secondary entries in the global translation tables since the current configuration should never result one client being accessible through multiple nodes.
Thanks, -Nick
-----Original Message----- From: B.A.T.M.A.N [mailto:b.a.t.m.a.n-bounces@lists.open-mesh.org] On Behalf Of Sven Eckelmann Sent: Monday, May 02, 2016 9:55 AM To: b.a.t.m.a.n@lists.open-mesh.org Subject: Re: [B.A.T.M.A.N.] mesh losing internal Ilayer3 connectivity
On Monday 02 May 2016 21:57:49 Karl Auer wrote:
My apologies up front for a newbie question in this apparently very technical list. If there is a more appropriate list or forum please direct me to it.
I'm running batman-adv (Chaos Calmer, r47065) on OpenWRT on the GL -AR150 platform.
Are you using v2016.1 or some older version of batman-adv? If you use something like v2014.4.
What kind of layer 3 are you using? IPv4/IPv6/...? What is you current configuration (for example are you have enabled DAT, BLA, ...). Did you check what exactly goes over the air and what the device (the adhoc one) receives/sends? What is what the data sent/received over the batman-adv devices?
Did you hardcode the mac address of the batman-adv device or are you let it change to a random value on each device creation? Is the device part of a bridge or is the IP configured directly on the batman-adv device?
Are you sure that the conntrack for the masquerade over the mesh isn't broken? Why are you masquerade over the mesh anyway?
Kind regards, Sven