Hi all, We're running a mesh network made of a cloud of clients and multiple gateways on two separate VLANs (on eth0, not on top of BATMAN). The setup is similar to the one described in the figure. https://www.open-mesh.org/attachments/download/132/Test_2xLAN.dia.png
We noticed that, sometimes, when new gateways are added to the already running infrastructure network loops appear on VLANs We dumped VLANs network traffic during one of these loops and we saw a storm of BLA frames that collapsed the network. It seems that the frame (an ANNOUNCE one, in this case) was firstly generated by a gateway and started to loop inside the LAN, and then even the others gateways propagated the same frame. After a few seconds also other frames (coming from different gateways) started to loop.
Our hypothesis is that one of gateways directly injects BLA frames inside mesh and that lead to an unmanageable loop. So, we have 2 questions: - Are BLA frames (except for LOOP DETECT) allowed to flow only on LAN? - If so, is our hypothesis reasonable?
You can see the situation described above in the screenshot below. http://oi63.tinypic.com/v7wl1w.jpg
Clients and gateways are both made of RPis 3 Model B running BATMAN v2017.3
Thank you.
Hi Francesco,
On Tuesday, September 11, 2018 4:38:13 PM CEST Francesco Salvatore [fabbricadigitale] wrote:
Hi all, We're running a mesh network made of a cloud of clients and multiple gateways on two separate VLANs (on eth0, not on top of BATMAN). The setup is similar to the one described in the figure. https://www.open-mesh.org/attachments/download/132/Test_2xLAN.dia.png
We noticed that, sometimes, when new gateways are added to the already running infrastructure network loops appear on VLANs We dumped VLANs network traffic during one of these loops and we saw a storm of BLA frames that collapsed the network. It seems that the frame (an ANNOUNCE one, in this case) was firstly generated by a gateway and started to loop inside the LAN, and then even the others gateways propagated the same frame. After a few seconds also other frames (coming from different gateways) started to loop.
Our hypothesis is that one of gateways directly injects BLA frames inside mesh and that lead to an unmanageable loop. So, we have 2 questions:
- Are BLA frames (except for LOOP DETECT) allowed to flow only on
LAN?
Yes, all frames except LOOP DETECT are blocked in BATMAN
- If so, is our hypothesis reasonable?
You can see the situation described above in the screenshot below. http://oi63.tinypic.com/v7wl1w.jpg
Unfortunately the screenshot doesn't describe which packets looped exactly.
Are you sure it's an announce frame? It could also be a claim frame where two hosts try to claim hosts from each other.
BATMAN has a grace period to allow broadcasts from the LAN only after 1 minute of operation. This is done to make sure that the mesh is properly established and other gateways and their claims are detected before traffic is allowed on it, at least potentially looping traffic. Therefore, you should make sure (e.g. in your firmware or setup scripts) that the LAN is operational once batman is brought op.
If the mesh isn't fully established or it's actually split due to different channels or similar, then you may run in an unresolved limitation of BLA:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-avoidance-II#...
For this reason we have the loop detect packets. If a loop is detected, an uevent is sent to userspace, and the firmware should react appropiately, e.g. by shutting down batman-adv.
Cheers, Simon
Hi Simon,
Hi Francesco,
On Tuesday, September 11, 2018 4:38:13 PM CEST Francesco Salvatore [fabbricadigitale] wrote:
Hi all, We're running a mesh network made of a cloud of clients and multiple gateways on two separate VLANs (on eth0, not on top of BATMAN). The setup is similar to the one described in the figure. https://www.open-
mesh.org/attachments/download/132/Test_2xLAN.dia.png
We noticed that, sometimes, when new gateways are added to the already running infrastructure network loops appear on VLANs We dumped VLANs network traffic during one of these loops and we saw a storm of BLA frames that collapsed the network. It seems that the frame (an ANNOUNCE one, in this case) was firstly generated by a gateway and started to loop inside the LAN, and then even the others gateways propagated the same frame. After a few seconds also other frames (coming from different gateways) started to loop.
Our hypothesis is that one of gateways directly injects BLA frames inside mesh and that lead to an unmanageable loop. So, we have 2
questions:
- Are BLA frames (except for LOOP DETECT) allowed to flow only on
LAN?
Yes, all frames except LOOP DETECT are blocked in BATMAN
- If so, is our hypothesis reasonable?
You can see the situation described above in the screenshot below. http://oi63.tinypic.com/v7wl1w.jpg
Unfortunately the screenshot doesn't describe which packets looped exactly.
Are you sure it's an announce frame? It could also be a claim frame where two hosts try to claim hosts from each other.
As you can see here (http://oi66.tinypic.com/ofo5jn.jpg) the frame that's looping is an ANNOUNCE one, and so are the ones coming from Legra_55:3c:dc The last ANNOUNCE frame from those MACs were sent 10 seconds before they started looping, so it seems that at a certain time one of the gateways started to forward BLA broadcast traffic from LAN to mesh.
BATMAN has a grace period to allow broadcasts from the LAN only after 1 minute of operation. This is done to make sure that the mesh is properly established and other gateways and their claims are detected before
traffic is
allowed on it, at least potentially looping traffic. Therefore, you should
make
sure (e.g. in your firmware or setup scripts) that the LAN is operational
once
batman is brought op.
If the mesh isn't fully established or it's actually split due to
different
channels or similar, then you may run in an unresolved limitation of BLA:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop- avoidance-II#Limitations
For this reason we have the loop detect packets. If a loop is detected, an uevent is sent to userspace, and the firmware should react appropiately,
e.g.
by shutting down batman-adv.
We start gateways with this script placed in rc.local
sudo pkill wpa_supplicant sudo modprobe batman-adv sudo ip link set wlan0 down sleep 2s sudo iwconfig wlan0 mode ad-hoc sudo iwconfig wlan0 essid mesh-network sudo iwconfig wlan0 ap any sudo iwconfig wlan0 channel 44 sudo ip link set wlan0 up sudo batctl if add wlan0 sleep 1s sudo ip addr flush dev eth0 sudo ip link add name br-lan type bridge sudo ip link set dev eth0 master br-lan sudo ip link set dev bat0 master br-lan sudo ip link set up dev br-lan sudo batctl bl 1 sudo batctl gw server
As far as I can see the bridge interface gets IP/connectivity from LAN a few seconds after the script quits. Are there steps correct or there are possible timing issues? We're using the same essid/channel for all originators
Cheers, Simon
Cheers, Francesco
Hi Francesco,
On Wednesday, September 12, 2018 10:44:36 AM CEST Francesco Salvatore [fabbricadigitale] wrote:
Hi Simon,
Hi Francesco,
On Tuesday, September 11, 2018 4:38:13 PM CEST Francesco Salvatore
[fabbricadigitale] wrote:
Hi all, We're running a mesh network made of a cloud of clients and multiple gateways on two separate VLANs (on eth0, not on top of BATMAN). The setup is similar to the one described in the figure. https://www.open-%3E >
mesh.org/attachments/download/132/Test_2xLAN.dia.png
We noticed that, sometimes, when new gateways are added to the already running infrastructure network loops appear on VLANs We dumped VLANs network traffic during one of these loops and we saw a storm of BLA frames that collapsed the network. It seems that the frame (an ANNOUNCE one, in this case) was firstly generated by a gateway and started to loop inside the LAN, and then even the others gateways propagated the same frame. After a few seconds also other frames (coming from different gateways) started to loop.
Our hypothesis is that one of gateways directly injects BLA frames inside mesh and that lead to an unmanageable loop. So, we have 2
questions:
- Are BLA frames (except for LOOP DETECT) allowed to flow only on
LAN?
Yes, all frames except LOOP DETECT are blocked in BATMAN
- If so, is our hypothesis reasonable?
You can see the situation described above in the screenshot below. http://oi63.tinypic.com/v7wl1w.jpg
Unfortunately the screenshot doesn't describe which packets looped exactly.
Are you sure it's an announce frame? It could also be a claim frame where two hosts try to claim hosts from each other.
As you can see here (http://oi66.tinypic.com/ofo5jn.jpg) the frame that's looping is an ANNOUNCE one, and so are the ones coming from Legra_55:3c:dc The last ANNOUNCE frame from those MACs were sent 10 seconds before they started looping, so it seems that at a certain time one of the gateways started to forward BLA broadcast traffic from LAN to mesh.
That certainly looks like an announce frame. Do you see any other frames in between, like claim frames?
Announces are also sent after a couple of claim frames upon a request (check batadv_bla_answer_request). We actually had a bug where inconsistencies among the BLA tables could happen, but that was fixed before 2017.3 ...
BATMAN has a grace period to allow broadcasts from the LAN only after 1 minute of operation. This is done to make sure that the mesh is properly established and other gateways and their claims are detected before
traffic is
allowed on it, at least potentially looping traffic. Therefore, you should
make
sure (e.g. in your firmware or setup scripts) that the LAN is operational
once
batman is brought op.
If the mesh isn't fully established or it's actually split due to
different
channels or similar, then you may run in an unresolved limitation of BLA:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-%3E > avoidance-II#Limitations
For this reason we have the loop detect packets. If a loop is detected, an uevent is sent to userspace, and the firmware should react appropiately,
e.g.
by shutting down batman-adv.
We start gateways with this script placed in rc.local
sudo pkill wpa_supplicant sudo modprobe batman-adv sudo ip link set wlan0 down sleep 2s sudo iwconfig wlan0 mode ad-hoc sudo iwconfig wlan0 essid mesh-network sudo iwconfig wlan0 ap any sudo iwconfig wlan0 channel 44 sudo ip link set wlan0 up sudo batctl if add wlan0 sleep 1s sudo ip addr flush dev eth0 sudo ip link add name br-lan type bridge sudo ip link set dev eth0 master br-lan sudo ip link set dev bat0 master br-lan sudo ip link set up dev br-lan sudo batctl bl 1 sudo batctl gw server
As far as I can see the bridge interface gets IP/connectivity from LAN a few seconds after the script quits. Are there steps correct or there are possible timing issues? We're using the same essid/channel for all originators
It would be good to do "batctl bl 1" before adding bat0 to the bridge, otherwise you are not protected. Other than that, it looks fine to me.
Cheers, Simon
Hi Simon,
Hi Francesco,
On Wednesday, September 12, 2018 10:44:36 AM CEST Francesco Salvatore [fabbricadigitale] wrote:
Hi Simon,
Hi Francesco,
On Tuesday, September 11, 2018 4:38:13 PM CEST Francesco Salvatore
[fabbricadigitale] wrote:
Hi all, We're running a mesh network made of a cloud of clients and multiple gateways on two separate VLANs (on eth0, not on top of
BATMAN).
The setup is similar to the one described in the figure. https://www.open-%3E >
mesh.org/attachments/download/132/Test_2xLAN.dia.png
We noticed that, sometimes, when new gateways are added to the already running infrastructure network loops appear on VLANs We dumped VLANs network traffic during one of these loops and we saw a storm of BLA frames that collapsed the network. It seems that the frame (an ANNOUNCE one, in this case) was firstly generated by a gateway and started to loop inside the LAN, and then even the others gateways propagated the same frame. After a few seconds also other frames (coming from different gateways) started to loop.
Our hypothesis is that one of gateways directly injects BLA frames inside mesh and that lead to an unmanageable loop. So, we have 2
questions:
- Are BLA frames (except for LOOP DETECT) allowed to flow
only on
LAN?
Yes, all frames except LOOP DETECT are blocked in BATMAN
- If so, is our hypothesis reasonable?
You can see the situation described above in the screenshot below. http://oi63.tinypic.com/v7wl1w.jpg
Unfortunately the screenshot doesn't describe which packets looped exactly.
Are you sure it's an announce frame? It could also be a claim frame where two hosts try to claim hosts from each other.
As you can see here (http://oi66.tinypic.com/ofo5jn.jpg) the frame that's looping is an ANNOUNCE one, and so are the ones coming from Legra_55:3c:dc The last ANNOUNCE frame from those MACs were sent 10 seconds before they started looping, so it seems that at a certain time one of the gateways started to forward BLA broadcast traffic from
LAN to mesh.
That certainly looks like an announce frame. Do you see any other frames
in
between, like claim frames?
Announces are also sent after a couple of claim frames upon a request (check batadv_bla_answer_request). We actually had a bug where inconsistencies among the BLA tables could happen, but that was fixed before 2017.3 ...
BLA traffic seems regular. This (https://mega.nz/#!9ZkmharA!S9mFxvpnnnseu_l8H7MPfoZ7X1Ef0lNrJLVQOpgTg4w) is a dump of the broadcast traffic captured from LAN ports of four gateway (on two separate VLANs). As you can see loop starts at packet 2660. The four gateways are: . 00:0f:00:68:97:e4 (Bridge IP 10.140.0.61) . 00:0f:00:68:9f:4b (Bridge IP 10.140.0.17) . 00:0f:00:68:96:66 (Bridge IP 10.140.16.19) . 00:0f:00:55:3c:dc (Bridge IP 10.140.16.61)
BATMAN has a grace period to allow broadcasts from the LAN only after 1 minute of operation. This is done to make sure that the mesh is properly established and other gateways and their claims are detected before
traffic is
allowed on it, at least potentially looping traffic. Therefore, you should
make
sure (e.g. in your firmware or setup scripts) that the LAN is operational
once
batman is brought op.
If the mesh isn't fully established or it's actually split due to
different
channels or similar, then you may run in an unresolved limitation of
BLA:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-%3E > avoidance-II#Limitations
For this reason we have the loop detect packets. If a loop is detected, an uevent is sent to userspace, and the firmware should react appropiately,
e.g.
by shutting down batman-adv.
We start gateways with this script placed in rc.local
sudo pkill wpa_supplicant sudo modprobe batman-adv sudo ip link set wlan0 down sleep 2s sudo iwconfig wlan0 mode ad-hoc sudo iwconfig wlan0 essid mesh-network sudo iwconfig wlan0 ap any sudo iwconfig wlan0 channel 44 sudo ip link set wlan0 up sudo batctl if add wlan0 sleep 1s sudo ip addr flush dev eth0 sudo ip link add name br-lan type bridge sudo ip link set dev eth0 master br-lan sudo ip link set dev bat0 master br-lan sudo ip link set up dev br-lan sudo batctl bl 1 sudo batctl gw server
As far as I can see the bridge interface gets IP/connectivity from LAN a few seconds after the script quits. Are there steps correct or there are possible timing issues? We're using the same essid/channel for all originators
It would be good to do "batctl bl 1" before adding bat0 to the bridge, otherwise you are not protected. Other than that, it looks fine to me.
Am I wrong or "batctl bl 1" is redundant? As far as I can see, according to batctl, BLA is turned on by default in gw mode.
Cheers, Simon
Regards, Francesco
Hi Francesco,
On Monday, September 17, 2018 3:44:53 PM CEST Francesco Salvatore [fabbricadigitale] wrote:
LAN to mesh.
That certainly looks like an announce frame. Do you see any other frames
in
between, like claim frames?
Announces are also sent after a couple of claim frames upon a request (check batadv_bla_answer_request). We actually had a bug where inconsistencies among the BLA tables could happen, but that was fixed before 2017.3 ...
BLA traffic seems regular. This (https://mega.nz/#!9ZkmharA!S9mFxvpnnnseu_l8H7MPfoZ7X1Ef0lNrJLVQOpgTg4w) is a dump of the broadcast traffic captured from LAN ports of four gateway (on two separate VLANs). As you can see loop starts at packet 2660. The four gateways are: . 00:0f:00:68:97:e4 (Bridge IP 10.140.0.61) . 00:0f:00:68:9f:4b (Bridge IP 10.140.0.17) . 00:0f:00:68:96:66 (Bridge IP 10.140.16.19) . 00:0f:00:55:3c:dc (Bridge IP 10.140.16.61)
Hmm. There are already other packets looping in the beginning. There are some ARP requests which are repeated 4 times (packets 39 and following). Are those MACs on the network?
I don't really know what's going on from staring on this dump. You may want to remove components which are not vital and check if it's still happening. For example, you may want to connect the Raspis with a simple switch first (if you don't already do that). But the loop is already present before that announce loop, it seems - BLA would normally avoid repetitions.
BATMAN has a grace period to allow broadcasts from the LAN only after 1 minute of operation. This is done to make sure that the mesh is properly established and other gateways and their claims are detected before
traffic is
allowed on it, at least potentially looping traffic. Therefore, you should
make
sure (e.g. in your firmware or setup scripts) that the LAN is operational
once
batman is brought op.
If the mesh isn't fully established or it's actually split due to
different
channels or similar, then you may run in an unresolved limitation of
BLA:
https://www.open-mesh.org/projects/batman-adv/wiki/Bridge-loop-%3E > avoidance-II#Limitations
For this reason we have the loop detect packets. If a loop is detected, an uevent is sent to userspace, and the firmware should react appropiately,
e.g.
by shutting down batman-adv.
We start gateways with this script placed in rc.local
sudo pkill wpa_supplicant sudo modprobe batman-adv sudo ip link set wlan0 down sleep 2s sudo iwconfig wlan0 mode ad-hoc sudo iwconfig wlan0 essid mesh-network sudo iwconfig wlan0 ap any sudo iwconfig wlan0 channel 44 sudo ip link set wlan0 up sudo batctl if add wlan0 sleep 1s sudo ip addr flush dev eth0 sudo ip link add name br-lan type bridge sudo ip link set dev eth0 master br-lan sudo ip link set dev bat0 master br-lan sudo ip link set up dev br-lan sudo batctl bl 1 sudo batctl gw server
As far as I can see the bridge interface gets IP/connectivity from LAN a few seconds after the script quits. Are there steps correct or there are possible timing issues? We're using the same essid/channel for all originators
It would be good to do "batctl bl 1" before adding bat0 to the bridge, otherwise you are not protected. Other than that, it looks fine to me.
Am I wrong or "batctl bl 1" is redundant? As far as I can see, according to batctl, BLA is turned on by default in gw mode.
Hm you are right, it's probably enabled by default. That should be independent of the gateway feature though.
Cheers, Simon
b.a.t.m.a.n@lists.open-mesh.org