I'm very grateful for your very helpful attention to my bizarre problem, Simon.
Two days ago, my efforts to instrument the nodes had the weird side-effect of making the problem go away. So now I have no problem; only the mystery remains, and I have little hope of resolving it.
For the record, below are some details that seem relevant, at least to me.
On 2/19/20 4:50 AM, Simon Wunderlich wrote:
When nodes become unreachable, they do so only partially. Consider this weirdness I encountered two days ago: Given nodes a, b, c, d, from the perspective of a, d has disappeared; in other words, "a# batctl ping d" doesn't work. But I ssh'd from a to b, then from b to c, then from c to d, all successfully. And "a# batctl ping d" still wasn't working, even though I was talking to d through that chain of ssh pipes. Any ideas on what that might mean? (When I reboot a -- the gateway -- everything always works again, usually for many hours, but never as long as a whole day.)
Hmm, that's strange indeed. Did you have good connection between all those devices? There is a certain "horizon", e.g. if you have many weak links in a daisy chain the the OGMs are dropped before they are reaching the end of the path.
Did you see node D in the originator table of node A?
As discussed below, when I added instrumentation, the problem disappeared. (*insert muffled scream here*)
Do I have a problem because the two meshes, and everything connected to them, all share the same LAN? I note "received a claim frame from another group" in the above log excerpt. (I don't know what that means, but I'm guessing that the two meshes are getting each other's maintenance traffic.) Should the two meshes be separate subnets?
It's possible and perfectly fine if you have two meshes connected to the same LAN like this:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-Tes...
Just make sure that the meshes are properly disconnected and not rejoin from time to time (e.g. by having different SSIDs)
I think I had missed this page. Thanks for pointing it out.
In a similar vein: Should each node be running its own subnet?
Should I try changing all nodes over to BATMAN_V, rebooting them all, and hoping they re-establish contact? (It would be massively inconvenient to have to reset them all physically.)
No, BATMAN V will not magically fix this.
Then I won't switch to BATMAN_V. "If it ain't broke, don't fix it."
Should I try turning off bridge loop avoidance?
bridge loop avoidance should be on as soon as you have any two nodes connected to the same LAN and mesh at one time.
Then I guess I don't need BLA. I'm tempted to turn it off just to avoid the overhead, because only the gateways have wired access to the LAN, and all other nodes have only their respective meshes.
I think we should work on your a - b - c - d chain and find out why a can't talk to d. That seems like the most obvious symptom.
I would do that if it were still broken!
Here's what I did, in some detail: rosepark dot us hash Feb182020