On Friday, 12 April 2019 08:00:11 CEST Lucas Pickstone wrote:
We have problem where our mesh is failing after a node is removed from it. We have a test set up of 3 computers (A, B and C) that can all directly see each other in an ad hoc wireless network.
When we remove a node, the mesh between the remaining two nodes continues to work for a time (batman pings get through), but "batctl o" shows nothing recently last-seen. Once it times out after 200 seconds, the mesh dies (no further pings via batman get through).
It is normal that the mesh closes down when there are no remaining paths to the originators (your originators table was completely empty when the mesh broke down). Since it looks like it is working in unicast while the last-seen goes up, I would guess that broadcast packets on the underlying layer are either not correctly send or not correctly received by the wifi driver/ firmware on the remaining nodes.
Unsure how this is triggered by removing a single node. But in case you are using meshpoint interfaces with mesh_fwding=1 (which is wrong) and a higher multicast rate, it could have been that the removed nodes was actually relaying the bcast/mcast packets (which might be send with a higher rate as suggested on open-mesh.org - or which maybe was automatically selected by the ath10k firmware) and the remaining nodes B+C cannot really communicate on the selected mcast/bcast rate directly with each other.
If we bring the disconnected node back up, everything goes back to normal.
What interface mode are you using? meshpoint without mesh_fwding (yes, make sure that the mesh parameter mesh_fwding is really set to 0 [1] during your tests)? Or IBSS (which would usually mean an ancient firmware or Ben Greear's firmware)?
We're a little bit stuck on how to diagnose/debug what is happening. We thought that maybe the underlying ad hoc network was causing issues, but it seems okay - we can assign IP addresses on the wlan interfaces and ping without interruption through the entire test.
Please check whether it can receive/send broadcast packets in both directions during the test (according to your TQ with at least 2/3 of success with a bcast ping-pong). This is how OGMs (to generate the content of the originators table) are sent. Also make sure that your receive+send the stuff as actual broadcast packets. There are ways to let the wifi layer convert these broadcast/mcast frames to unicast frames - and this is not allowed for batman- adv because B.A.T.M.A.N. IV needs to measure the loss for broadcast packets - unicast packets are retransmitted automatically by the wifi components (which we don't want).
[...]
Environment (identical on all computers):
- WiFi Card : SparkLAN WPEQ-160ACN(BT)
It is a QCA9377-7 miniPCIe, right? Is it using the PCI pins or the USB pins [2]?
- WiFi driver: nl80211
This cannot be the driver - this is the module which allows userspace to communicate with cfg80211 (and vice versa). Maybe it is ath10k, ath10k-ct or the proprietary QCA monster (which recently received semi working cfg80211 support).
Kind regards, Sven
[1] iw dev wlan0 get mesh_param mesh_fwding iw dev wlan0 set mesh_param mesh_fwding 0 [2] https://www.qualcomm.com/media/documents/files/qca9377-product-brief.pdf
PS: Not the best idea to send from a DMARC (quarantine policy) domain to a mailing list.