[B.A.T.M.A.N.] broadcast storms

List overview All Threads
Download

newer

older

[B.A.T.M.A.N.] [PATCH 00/16] pull...

[B.A.T.M.A.N.] [PATCH 3.16...

Jake.Harris＠zf.com

22 Oct 2018 22 Oct '18

1:07 p.m.

I'm sure a similar question to this has been answered, but I am new to this mailing list format and don't know an efficient way to search https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/

I'm having problems with broadcast messages effectively echoing around the network of 50ish nodes. I attached a few seconds of the batctl tcpdump output. I can't seem to find a pattern to what causes this, it tends to happen once every two or three weeks, the storm causes problems with the batman program where during the storm nodes drop all their neighbors (batctl n shows an empty list) indefinitely, which I have worked around that issue via a batch script that reloads batman if the neighbor list is empty. Reloading successfully reconnects to the network but the storm still persists.

The only way I've found to fix this is to reboot all the nodes at the same time such that the whole network is down to kill the echos.

I believe I had this problem much more frequently (every 4 days or so) a while ago on the same network when using discrete tcp destinations for the nodes to communicate, the storm frequency was reduced to what it is now by using broadcast packets and reducing the communication rate from 12 seconds to once every 40 seconds.

Rebooting the nodes that are responsible for the echoing messages has no effect, I rebooted 192.168.1.230 before running tcpdump that is attached and as it shows packets from 230 continued to bounce around while the node was powered off and after it rejoined the network. It doesn't appear broadcast uses a time-to-live parameter to limit the hops the packets will make.

I'm at a loss for a way to remedy this, there seems to only be multicast optimizations.

Attachments:

batdump.a (application/octet-stream — 51.5 KB)

Show replies by date

Simon Wunderlich

22 Oct 22 Oct

2:26 p.m.

Hi Jake,

could you make some pcap dumps on the wlan device where batman runs, and provide that to us? Just the the full tcpdump (tcpdump -s 2000 -w /tmp/my.pcap wlan0, assuming that wlan0 is your interface), not batctl dump? Then we can check sequence numbers etc in wireshark.

Do you have some of your mesh nodes connected and bridged to Ethernet? If yes, you should check the bridge loop avoidance which could also be causing this effect, if you don't have it enabled and use such a topology:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II

Cheers, Simon

On Monday, October 22, 2018 1:07:29 PM CEST Jake.Harris@zf.com wrote:

...

I'm sure a similar question to this has been answered, but I am new to this mailing list format and don't know an efficient way to search https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/

I'm having problems with broadcast messages effectively echoing around the network of 50ish nodes. I attached a few seconds of the batctl tcpdump output. I can't seem to find a pattern to what causes this, it tends to happen once every two or three weeks, the storm causes problems with the batman program where during the storm nodes drop all their neighbors (batctl n shows an empty list) indefinitely, which I have worked around that issue via a batch script that reloads batman if the neighbor list is empty. Reloading successfully reconnects to the network but the storm still persists.

The only way I've found to fix this is to reboot all the nodes at the same time such that the whole network is down to kill the echos.

I believe I had this problem much more frequently (every 4 days or so) a while ago on the same network when using discrete tcp destinations for the nodes to communicate, the storm frequency was reduced to what it is now by using broadcast packets and reducing the communication rate from 12 seconds to once every 40 seconds.

Rebooting the nodes that are responsible for the echoing messages has no effect, I rebooted 192.168.1.230 before running tcpdump that is attached and as it shows packets from 230 continued to bounce around while the node was powered off and after it rejoined the network. It doesn't appear broadcast uses a time-to-live parameter to limit the hops the packets will make.

I'm at a loss for a way to remedy this, there seems to only be multicast optimizations.

Jake.Harris＠zf.com

5:27 p.m.

Generated this via: sudo tcpdump -s 2000 -w /media/pi/KINGSTON/my.pcap -i wlx681ca2083fa4

message after ^c 12090 packets captured 12251 packets received by filter 0 packets dropped by kernel 27 packets dropped by interface

-----Original Message----- From: Simon Wunderlich sw@simonwunderlich.de Sent: Monday, October 22, 2018 10:27 To: b.a.t.m.a.n@lists.open-mesh.org Cc: Harris Jake LPR Jake.Harris@zf.com Subject: Re: [B.A.T.M.A.N.] broadcast storms

* PGP Signed by an unknown key

Hi Jake,

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II

Cheers, Simon

On Monday, October 22, 2018 1:07:29 PM CEST Jake.Harris@zf.com wrote:

...

I'm sure a similar question to this has been answered, but I am new to this mailing list format and don't know an efficient way to search https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/

I'm having problems with broadcast messages effectively echoing around the network of 50ish nodes. I attached a few seconds of the batctl tcpdump output. I can't seem to find a pattern to what causes this, it tends to happen once every two or three weeks, the storm causes problems with the batman program where during the storm nodes drop all their neighbors (batctl n shows an empty list) indefinitely, which I have worked around that issue via a batch script that reloads batman if the neighbor list is empty. Reloading successfully reconnects to the network but the storm still persists.

The only way I've found to fix this is to reboot all the nodes at the same time such that the whole network is down to kill the echos.

I believe I had this problem much more frequently (every 4 days or so) a while ago on the same network when using discrete tcp destinations for the nodes to communicate, the storm frequency was reduced to what it is now by using broadcast packets and reducing the communication rate from 12 seconds to once every 40 seconds.

Rebooting the nodes that are responsible for the echoing messages has no effect, I rebooted 192.168.1.230 before running tcpdump that is attached and as it shows packets from 230 continued to bounce around while the node was powered off and after it rejoined the network. It doesn't appear broadcast uses a time-to-live parameter to limit the hops the packets will make.

I'm at a loss for a way to remedy this, there seems to only be multicast optimizations.

* Unknown Key * 0x42929EA1

Simon Wunderlich

6:17 p.m.

Hello Jake,

I've checked your pcap files. I couldn't find a culprit directly, but it seems like you are having so many repetitions / the network is getting so overloaded that broadcasts stay in the queues of your WiFi driver for longer than 30 seconds (possibly in different devices, accumulated). At this point, batman- adv assumes that the device has rebooted and the sequence number is validly re-used, thus circumventing the broadcast duplicate check.

You could increase the define of BATADV_RESET_PROTECTION_MS to something higher like 120000 (120 seconds) and see if that helps. But the "right" way would be to avoid those deep queues in the first place.

Do you set a multicast rate higher than the default 1 MBit/s? If not, that's worth a try. :) If you are using iw, there is a "mcast-rate" parameter, and there is something equivalent in wpa_supplicant.

Cheers, Simon

On Monday, October 22, 2018 5:27:28 PM CEST Jake.Harris@zf.com wrote:

...

Generated this via: sudo tcpdump -s 2000 -w /media/pi/KINGSTON/my.pcap -i wlx681ca2083fa4

message after ^c 12090 packets captured 12251 packets received by filter 0 packets dropped by kernel 27 packets dropped by interface

-----Original Message----- From: Simon Wunderlich sw@simonwunderlich.de Sent: Monday, October 22, 2018 10:27 To: b.a.t.m.a.n@lists.open-mesh.org Cc: Harris Jake LPR Jake.Harris@zf.com Subject: Re: [B.A.T.M.A.N.] broadcast storms

PGP Signed by an unknown key

Hi Jake,

could you make some pcap dumps on the wlan device where batman runs, and provide that to us? Just the the full tcpdump (tcpdump -s 2000 -w /tmp/my.pcap wlan0, assuming that wlan0 is your interface), not batctl dump? Then we can check sequence numbers etc in wireshark.

Do you have some of your mesh nodes connected and bridged to Ethernet? If yes, you should check the bridge loop avoidance which could also be causing this effect, if you don't have it enabled and use such a topology:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II

Cheers, Simon

On Monday, October 22, 2018 1:07:29 PM CEST Jake.Harris@zf.com wrote:

...
I'm sure a similar question to this has been answered, but I am new to this mailing list format and don't know an efficient way to search https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/

I'm having problems with broadcast messages effectively echoing around the network of 50ish nodes. I attached a few seconds of the batctl tcpdump output. I can't seem to find a pattern to what causes this, it tends to happen once every two or three weeks, the storm causes problems with the batman program where during the storm nodes drop all their neighbors (batctl n shows an empty list) indefinitely, which I have worked around that issue via a batch script that reloads batman if the neighbor list is empty. Reloading successfully reconnects to the network but the storm still persists.

The only way I've found to fix this is to reboot all the nodes at the same time such that the whole network is down to kill the echos.

I believe I had this problem much more frequently (every 4 days or so) a while ago on the same network when using discrete tcp destinations for the nodes to communicate, the storm frequency was reduced to what it is now by using broadcast packets and reducing the communication rate from 12 seconds to once every 40 seconds.

Rebooting the nodes that are responsible for the echoing messages has no effect, I rebooted 192.168.1.230 before running tcpdump that is attached and as it shows packets from 230 continued to bounce around while the node was powered off and after it rejoined the network. It doesn't appear broadcast uses a time-to-live parameter to limit the hops the packets will make.

I'm at a loss for a way to remedy this, there seems to only be multicast optimizations.

Unknown Key

0x42929EA1

Jake.Harris＠zf.com

12 Nov 12 Nov

2:29 p.m.

My apologies bringing this back but I'm still having trouble with this. Each of my 50 nodes sends a 100 byte packet via broadcast every 30 seconds, even if coincidently they all transmit it at the same time that's only 5kB of data to move, much under the 1MB (default I believe?) bandwidth for multicast. Unless there's a large amount of broadcast traffic outside of my python program I don't see this being the culprit.

The storms tend to start when connection to a few nodes gets flakey, so I can see the reset protection setting being helpful, how do I set BATADV_RESET_PROTECTION_MS? Is this an environment variable or do I set it with batctl?

Thanks again for your help, I believe this has a good chance of remedying the issue.

-----Original Message----- From: Simon Wunderlich sw@simonwunderlich.de Sent: Monday, October 22, 2018 14:17 To: Harris Jake LPR Jake.Harris@zf.com Cc: b.a.t.m.a.n@lists.open-mesh.org Subject: Re: [B.A.T.M.A.N.] broadcast storms

* PGP Signed by an unknown key

Hello Jake,

Cheers, Simon

On Monday, October 22, 2018 5:27:28 PM CEST Jake.Harris@zf.com wrote:

...

Generated this via: sudo tcpdump -s 2000 -w /media/pi/KINGSTON/my.pcap -i wlx681ca2083fa4

message after ^c 12090 packets captured 12251 packets received by filter 0 packets dropped by kernel 27 packets dropped by interface

-----Original Message----- From: Simon Wunderlich sw@simonwunderlich.de Sent: Monday, October 22, 2018 10:27 To: b.a.t.m.a.n@lists.open-mesh.org Cc: Harris Jake LPR Jake.Harris@zf.com Subject: Re: [B.A.T.M.A.N.] broadcast storms

...
Old Signed by an unknown key

Hi Jake,

could you make some pcap dumps on the wlan device where batman runs, and provide that to us? Just the the full tcpdump (tcpdump -s 2000 -w /tmp/my.pcap wlan0, assuming that wlan0 is your interface), not batctl dump? Then we can check sequence numbers etc in wireshark.

Do you have some of your mesh nodes connected and bridged to Ethernet? If yes, you should check the bridge loop avoidance which could also be causing this effect, if you don't have it enabled and use such a topology:

https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidan ce-II

Cheers, Simon

On Monday, October 22, 2018 1:07:29 PM CEST Jake.Harris@zf.com wrote:

...
I'm sure a similar question to this has been answered, but I am new to this mailing list format and don't know an efficient way to search https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/

I'm having problems with broadcast messages effectively echoing around the network of 50ish nodes. I attached a few seconds of the batctl tcpdump output. I can't seem to find a pattern to what causes this, it tends to happen once every two or three weeks, the storm causes problems with the batman program where during the storm nodes drop all their neighbors (batctl n shows an empty list) indefinitely, which I have worked around that issue via a batch script that reloads batman if the neighbor list is empty. Reloading successfully reconnects to the network but the storm still persists.

The only way I've found to fix this is to reboot all the nodes at the same time such that the whole network is down to kill the echos.

I believe I had this problem much more frequently (every 4 days or so) a while ago on the same network when using discrete tcp destinations for the nodes to communicate, the storm frequency was reduced to what it is now by using broadcast packets and reducing the communication rate from 12 seconds to once every 40 seconds.

Rebooting the nodes that are responsible for the echoing messages has no effect, I rebooted 192.168.1.230 before running tcpdump that is attached and as it shows packets from 230 continued to bounce around while the node was powered off and after it rejoined the network. It doesn't appear broadcast uses a time-to-live parameter to limit the hops the packets will make.

I'm at a loss for a way to remedy this, there seems to only be multicast optimizations.

Unknown Key

0x42929EA1

* Unknown Key * 0x42929EA1

Simon Wunderlich

5:13 p.m.

Hi Jake,

On Monday, November 12, 2018 2:29:07 PM CET Jake.Harris@zf.com wrote:

...

My apologies bringing this back but I'm still having trouble with this. Each of my 50 nodes sends a 100 byte packet via broadcast every 30 seconds, even if coincidently they all transmit it at the same time that's only 5kB of data to move, much under the 1MB (default I believe?) bandwidth for multicast. Unless there's a large amount of broadcast traffic outside of my python program I don't see this being the culprit.

Mhm, this is really not much data ... did you try the multicast as suggested in an earlier reply?

...

The storms tend to start when connection to a few nodes gets flakey, so I can see the reset protection setting being helpful, how do I set BATADV_RESET_PROTECTION_MS? Is this an environment variable or do I set it with batctl?

BATADV_RESET_PROTECTION_MS is a define in the batman-adv C-code, so it can't be set at runtime but only at compile time.

Cheers, Simon

Jake.Harris＠zf.com

13 Nov 13 Nov

2:55 p.m.

...

Mhm, this is really not much data ... did you try the multicast as suggested in an earlier reply?

What earlier reply are you referring to? The only one I'm noticing is the tip to boost the multicast bandwidth, but I cannot see this being fruitful to update the configuration of all 50 nodes when worst-case I'm using less than 1% of the max throughput.

...

BATADV_RESET_PROTECTION_MS is a define in the batman-adv C-code, so it can't be set at runtime but only at compile time.

While this sounds like an utter pain in the butt to recompile and update the code on all the nodes to make this change, I believe this has a far better chance of alleviating the issue, I'm looking into how to do this since I've never compiled anything myself but I can't see it being too difficult.

One observation I made when rebooting the swarm all at once, about a minute after all the pi's go down the laptop I work off (running batctl td bat0) reports a whole bunch of backbone unannounced messages I believe. I'm assuming there is one message per node but have not verified, my guess is this is normal and is not the cause of these issues?

Again, thank you

Simon Wunderlich

3:26 p.m.

On Tuesday, November 13, 2018 2:55:31 PM CET Jake.Harris@zf.com wrote:

...

...
Mhm, this is really not much data ... did you try the multicast as suggested in an earlier reply?

What earlier reply are you referring to? The only one I'm noticing is the tip to boost the multicast bandwidth, but I cannot see this being fruitful to update the configuration of all 50 nodes when worst-case I'm using less than 1% of the max throughput.

One aspect is that the multicast rate is also changing the modulation rate of beacons. If you have >50 nodes beaconing with 1 Mbit/s you are already filling up your airtime with beacons. Do the math - one beacon takes about 1ms on 1 Mbit/s, each node sends about 10 beacons per second ...

This is actually very important and will most likely help already. It would be a better fix than changing the protection window.

...

...
BATADV_RESET_PROTECTION_MS is a define in the batman-adv C-code, so it can't be set at runtime but only at compile time.

While this sounds like an utter pain in the butt to recompile and update the code on all the nodes to make this change, I believe this has a far better chance of alleviating the issue, I'm looking into how to do this since I've never compiled anything myself but I can't see it being too difficult.

One observation I made when rebooting the swarm all at once, about a minute after all the pi's go down the laptop I work off (running batctl td bat0) reports a whole bunch of backbone unannounced messages I believe. I'm assuming there is one message per node but have not verified, my guess is this is normal and is not the cause of these issues?

Again, thank you

2270

Age (days ago)

2292

Last active (days ago)

b.a.t.m.a.n@lists.open-mesh.org

7 comments

2 participants

tags (0)

participants (2)

Jake.Harris＠zf.com
Simon Wunderlich