The current default settings for optional features in batman-adv seems to be based around the idea that the user only compiles what he requires. They will automatically enabled when they are compiled in. For example the network coding part of batman-adv is by default disabled in the out-of-tree module but will be enabled when the code is compiled during the module build.
But distributions like Debian just enable all features of the batman-adv kernel module and hope that more experimental features or features with possible negative effects have to be enabled using some runtime configuration interface.
The network_coding feature can help in specific setups but also has drawbacks and is not disabled by default in the out-of-tree module. Disabling by default in the runtime config seems to be also quite sane.
The distributed_arp_table is in theory a good solution to reduce connection problems in large networks caused by ARP packet loss. Unfortunatelly, it seems to also break ARP resolution in simple mesh setups. The only solution which seems to be used by AP firmwares seems to be the deactivation of this feature. Disabling this feature by default until the problem was understood and fixed may help new deployments to create a working mesh. Tuning of the mesh can still be done by them in case DAT works in their setup.
The bridge_loop_avoidance is the only feature which is disabled by default but may be necessary even in simple setups. Packet loops may even be created during the initial node setup when this is not enabled. This is different than STP on bridges because mesh is usually used on Adhoc WiFi. Having two nodes (by accident) in the same LAN segment and in the same mesh network is rather common in this situation.
Signed-off-by: Sven Eckelmann sven@narfation.org -- This patchset is based on my comments in https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2014-November/012543.html
The patch doesn't touch the multicast feature because I have no idea how well it works and what the current multicast mode is actual doing. --- network-coding.c | 2 +- soft-interface.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/network-coding.c b/network-coding.c index 127cc4d..be005b2 100644 --- a/network-coding.c +++ b/network-coding.c @@ -155,7 +155,7 @@ err: */ void batadv_nc_init_bat_priv(struct batadv_priv *bat_priv) { - atomic_set(&bat_priv->network_coding, 1); + atomic_set(&bat_priv->network_coding, 0); bat_priv->nc.min_tq = 200; bat_priv->nc.max_fwd_delay = 10; bat_priv->nc.max_buffer_time = 200; diff --git a/soft-interface.c b/soft-interface.c index 8748987..ae96cee 100644 --- a/soft-interface.c +++ b/soft-interface.c @@ -738,10 +738,10 @@ static int batadv_softif_init_late(struct net_device *dev) atomic_set(&bat_priv->aggregated_ogms, 1); atomic_set(&bat_priv->bonding, 0); #ifdef CONFIG_BATMAN_ADV_BLA - atomic_set(&bat_priv->bridge_loop_avoidance, 0); + atomic_set(&bat_priv->bridge_loop_avoidance, 1); #endif #ifdef CONFIG_BATMAN_ADV_DAT - atomic_set(&bat_priv->distributed_arp_table, 1); + atomic_set(&bat_priv->distributed_arp_table, 0); #endif #ifdef CONFIG_BATMAN_ADV_MCAST bat_priv->mcast.flags = BATADV_NO_FLAGS;
Hi Sven, thanks for this patch!
On 05/02/15 17:22, Sven Eckelmann wrote:
The distributed_arp_table is in theory a good solution to reduce connection problems in large networks caused by ARP packet loss. Unfortunatelly, it seems to also break ARP resolution in simple mesh setups. The only solution which seems to be used by AP firmwares seems to be the deactivation of this feature.
Could you please be a bit more specific about the issue you are talking about?
So far the only situation where I personally switched DAT off did not really get fixed and continued to show the issue. Therefore I am not really sure that DAT can be blamed for this.
So far I don't know anybody else who complained about DAT.
Cheers,
On Friday 06 February 2015 08:26:04 Antonio Quartulli wrote:
Could you please be a bit more specific about the issue you are talking about?
I can only talk about the simple two node installation I had when I was forced to write the wireshark-batman-adv dissector for v15. I just placed them next to each other and tried to ping the other one (not sure anymore if it was the node directly or two other devices plugged in the ethernet port of the devices). And I think there was a reboot of one node and a ARP cache flush involved. I've expected to see the 4addr packets when capturing on the wireless and then having both nodes talking to each other. This was not working. I could only communicate after disabling DAT... but of course the communication which I wanted to capture wasn't happening.
Either I got a dump from you or I've tried again until it worked.
So far the only situation where I personally switched DAT off did not really get fixed and continued to show the issue. Therefore I am not really sure that DAT can be blamed for this.
So far I don't know anybody else who complained about DAT.
You have to ask Marek why DAT is disabled in many installations for which he is more or less responsible. Maybe he has more to contribute than me.
Kind regards, Sven
On 06/02/15 13:38, Sven Eckelmann wrote:
On Friday 06 February 2015 08:26:04 Antonio Quartulli wrote:
Could you please be a bit more specific about the issue you are talking about?
I can only talk about the simple two node installation I had when I was forced to write the wireshark-batman-adv dissector for v15. I just placed them next to each other and tried to ping the other one (not sure anymore if it was the node directly or two other devices plugged in the ethernet port of the devices). And I think there was a reboot of one node and a ARP cache flush involved. I've expected to see the 4addr packets when capturing on the wireless and then having both nodes talking to each other. This was not working. I could only communicate after disabling DAT... but of course the communication which I wanted to capture wasn't happening.
Thanks for the explanation Sven.
To me this looks like the caching effect of DAT: bat0 is likely to change MAC address everytime you recreate the interface (unless you statically assign the MAC) therefore I presume that the other node (the one which did not reboot) was still caching the old bat0 MAC address (each entry requires some minutes before being invalidated/refreshed).
This means that any communication willing to contact the rebooted node was targetting the old address and therefore the two were not be able talk until the cache was refreshed.
Can this be the case?
Cheers,
On Friday 06 February 2015 14:20:28 Antonio Quartulli wrote:
To me this looks like the caching effect of DAT: bat0 is likely to change MAC address everytime you recreate the interface (unless you statically assign the MAC) therefore I presume that the other node (the one which did not reboot) was still caching the old bat0 MAC address (each entry requires some minutes before being invalidated/refreshed).
Thanks for the explanation.
So you are basically saying that DAT cannot detect when some nodes found a conflict in their ARP table (when receiving IP packets with a different MAC for an IP or similar things)? And there is also the problem when no local information is stored and only a conflict with some data data on another unrelated node is happened.
The first conflict (the conflict in the local ARP table) is not detected because David wanted that batman-adv isn't accessing the ARP table?
This means that any communication willing to contact the rebooted node was targetting the old address and therefore the two were not be able talk until the cache was refreshed.
Can this be the case?
Yes, unfortunately I hadn't the time to analyze it further and have no logs of any similar problem. Thats why I cannot check if this is contradicted by anything I've done (or not done). So it is a very plausible scenario which you've described.
Maybe Marek can tell me more why he chose to disable it.
Kind regards, Sven
On 07/02/15 08:27, Sven Eckelmann wrote:
On Friday 06 February 2015 14:20:28 Antonio Quartulli wrote:
To me this looks like the caching effect of DAT: bat0 is likely to change MAC address everytime you recreate the interface (unless you statically assign the MAC) therefore I presume that the other node (the one which did not reboot) was still caching the old bat0 MAC address (each entry requires some minutes before being invalidated/refreshed).
Thanks for the explanation.
So you are basically saying that DAT cannot detect when some nodes found a conflict in their ARP table (when receiving IP packets with a different MAC for an IP or similar things)?
Right. Well, the scenario "incoming packets altering the local ARP table and generating a conflict" was never really considered.
And there is also the problem when no local information is stored and only a conflict with some data data on another unrelated node is happened.
I haven't really understood this issue.
The first conflict (the conflict in the local ARP table) is not detected because David wanted that batman-adv isn't accessing the ARP table?
Even without reading the the ARP table we could "detect the conflict" by checking incoming packets and by matching their IP/MAC couple with what we have in DAT...At the moment we only assume that the IP/MAC couple is stable enough and that in the worst case the user will wait for the cache to get invalidated.
Still, I think this is an optimization that we may want to implement, but not a real issue that pushes us to disable DAT by default.
This means that any communication willing to contact the rebooted node was targetting the old address and therefore the two were not be able talk until the cache was refreshed.
Can this be the case?
Yes, unfortunately I hadn't the time to analyze it further and have no logs of any similar problem. Thats why I cannot check if this is contradicted by anything I've done (or not done). So it is a very plausible scenario which you've described.
No problem, I just wanted to get your feeling/feedback about that :)
Thanks a lot.
Cheers,
On Thursday, February 05, 2015 17:22:53 Sven Eckelmann wrote:
The current default settings for optional features in batman-adv seems to be based around the idea that the user only compiles what he requires. They will automatically enabled when they are compiled in. For example the network coding part of batman-adv is by default disabled in the out-of-tree module but will be enabled when the code is compiled during the module build.
But distributions like Debian just enable all features of the batman-adv kernel module and hope that more experimental features or features with possible negative effects have to be enabled using some runtime configuration interface.
Interesting point. Based on what you are saying we definitely should review our policy and agree on sane defaults.
The network_coding feature can help in specific setups but also has drawbacks and is not disabled by default in the out-of-tree module. Disabling by default in the runtime config seems to be also quite sane.
This feature requires the wifi driver to support promisc mode. We should keep it disabled.
The distributed_arp_table is in theory a good solution to reduce connection problems in large networks caused by ARP packet loss. Unfortunatelly, it seems to also break ARP resolution in simple mesh setups. The only solution which seems to be used by AP firmwares seems to be the deactivation of this feature. Disabling this feature by default until the problem was understood and fixed may help new deployments to create a working mesh. Tuning of the mesh can still be done by them in case DAT works in their setup.
I vote for keeping DAT enabled.
The bridge_loop_avoidance is the only feature which is disabled by default but may be necessary even in simple setups. Packet loops may even be created during the initial node setup when this is not enabled. This is different than STP on bridges because mesh is usually used on Adhoc WiFi. Having two nodes (by accident) in the same LAN segment and in the same mesh network is rather common in this situation.
Agreed.
Cheers, Marek
Hi Sven, Marek,
On 2015-02-18 08:33, Marek Lindner wrote:
On Thursday, February 05, 2015 17:22:53 Sven Eckelmann wrote:
The current default settings for optional features in batman-adv seems to be based around the idea that the user only compiles what he requires. They will automatically enabled when they are compiled in. For example the network coding part of batman-adv is by default disabled in the out-of-tree module but will be enabled when the code is compiled during the module build.
But distributions like Debian just enable all features of the batman-adv kernel module and hope that more experimental features or features with possible negative effects have to be enabled using some runtime configuration interface.
Interesting point. Based on what you are saying we definitely should review our policy and agree on sane defaults.
Sane defaults have always been one of the great values of batman-adv...
The network_coding feature can help in specific setups but also has drawbacks and is not disabled by default in the out-of-tree module. Disabling by default in the runtime config seems to be also quite sane.
This feature requires the wifi driver to support promisc mode. We should keep it disabled.
I think the problem Sven mentions, is that some distro's build their kernels with all_yes, and so network coding is always built in. Since NC is enabled by default (runtime) when compiled in, it may run even when promisc mode is not enabled/supported.
So I suggest NC being disabled runtime by default.
The distributed_arp_table is in theory a good solution to reduce connection problems in large networks caused by ARP packet loss. Unfortunatelly, it seems to also break ARP resolution in simple mesh setups. The only solution which seems to be used by AP firmwares seems to be the deactivation of this feature. Disabling this feature by default until the problem was understood and fixed may help new deployments to create a working mesh. Tuning of the mesh can still be done by them in case DAT works in their setup.
I vote for keeping DAT enabled.
++
The bridge_loop_avoidance is the only feature which is disabled by default but may be necessary even in simple setups. Packet loops may even be created during the initial node setup when this is not enabled. This is different than STP on bridges because mesh is usually used on Adhoc WiFi. Having two nodes (by accident) in the same LAN segment and in the same mesh network is rather common in this situation.
Agreed.
++
b.a.t.m.a.n@lists.open-mesh.org