While I'm still in Europe I've observed that the network in Quintana has started performing very poorly today. It was working perfectly fine until yesterday.
The logs on every router have started showing entries like these:
Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: received packet on bat0 with own address as source address
As you can see there are many per second.
I've pasted a bit of batctl ll batman; batctl log here:
...it's only showing the "originator packet from myself" lines and one line before. (the sample is less than 5 secs of logs)
Every node I checked is showing the same.
Last time this happened it was due to a router that had been affected by a nearby lightning bolt. The switch went crazy. It took a while to detect it and the network was 15 nodes big. Now it's 40 and we are quite far away :)
If anyone has an idea of how to better test where the problem is originated, I'll be glad to hear it. Also if any batman devel wishes to log in to the net to check first hand, just let me know.
Cheers! Nico
PS: batman version is 2012.4
back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts.
cheers, Nico
El 13/10/13 18:34, Nicolás Echániz escribió:
While I'm still in Europe I've observed that the network in Quintana has started performing very poorly today. It was working perfectly fine until yesterday.
The logs on every router have started showing entries like these:
Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: received packet on bat0 with own address as source address
As you can see there are many per second.
I've pasted a bit of batctl ll batman; batctl log here:
...it's only showing the "originator packet from myself" lines and one line before. (the sample is less than 5 secs of logs)
Every node I checked is showing the same.
Last time this happened it was due to a router that had been affected by a nearby lightning bolt. The switch went crazy. It took a while to detect it and the network was 15 nodes big. Now it's 40 and we are quite far away :)
If anyone has an idea of how to better test where the problem is originated, I'll be glad to hear it. Also if any batman devel wishes to log in to the net to check first hand, just let me know.
Cheers! Nico
PS: batman version is 2012.4
Hi Nico,
I have no real clue, but is it possible that there is a loop somewhere? I imagine you have checked already..but I can't come with something more useful at the moment..
Cheers,
On Tue, Nov 12, 2013 at 06:45:40PM -0300, Nicolás Echániz wrote:
back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts.
cheers, Nico
El 13/10/13 18:34, Nicolás Echániz escribió:
While I'm still in Europe I've observed that the network in Quintana has started performing very poorly today. It was working perfectly fine until yesterday.
The logs on every router have started showing entries like these:
Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: received packet on bat0 with own address as source address
As you can see there are many per second.
I've pasted a bit of batctl ll batman; batctl log here:
...it's only showing the "originator packet from myself" lines and one line before. (the sample is less than 5 secs of logs)
Every node I checked is showing the same.
Last time this happened it was due to a router that had been affected by a nearby lightning bolt. The switch went crazy. It took a while to detect it and the network was 15 nodes big. Now it's 40 and we are quite far away :)
If anyone has an idea of how to better test where the problem is originated, I'll be glad to hear it. Also if any batman devel wishes to log in to the net to check first hand, just let me know.
Cheers! Nico
PS: batman version is 2012.4
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
El 12/11/13 18:54, Antonio Quartulli escribió:
Hi Nico,
I have no real clue, but is it possible that there is a loop somewhere? I imagine you have checked already..but I can't come with something more useful at the moment..
Ok, I've spent the afternoon turning off semi-inaccessible nodes one by one until I found the one causing the problem.
It's installed on a public lighting post, so it may take a while to take it down for inspection.
I don't know if you guys remember I had brought to the battlemesh a crazy node (nicknamed Jocker), that started misbehaving after a lightning bolt hit nearby. The symptom was the same I observed now: every node in the net would start repeatedly showing the message: "received packet on bat0 with own address as source address".
I was in Europe during the time this second node started behaving like this so I still don't know much about the moment it started.
Do you think this matter could be addressed at batman level somehow? In a 50 node network this is already quite difficult to diagnose. I can't imagine how a bigger network where no single person has remote access to every node would coordinate to isolate the problematic router...
If you are interested in looking at this first hand we can try to set up an isolated test-bed with IPv6 connectivity for you to log in and play around.
Am I the only one who has bumped into this (twice)?
cheers.
On Tue, Nov 12, 2013 at 06:45:40PM -0300, Nicolás Echániz wrote:
back in Quintana... this problem is still showing in every node. The network is unstable and so it's difficult to debug. If anyone has a clue as to where to look for the origin I'll be glad to read your thoughts.
cheers, Nico
El 13/10/13 18:34, Nicolás Echániz escribió:
While I'm still in Europe I've observed that the network in Quintana has started performing very poorly today. It was working perfectly fine until yesterday.
The logs on every router have started showing entries like these:
Oct 13 18:09:43 frigorifico kern.warn kernel: [12018.150000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.040000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.550000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.570000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:45 frigorifico kern.warn kernel: [12020.580000] br-lan: received packet on bat0 with own address as source address Oct 13 18:09:46 frigorifico kern.warn kernel: [12021.040000] br-lan: received packet on bat0 with own address as source address
As you can see there are many per second.
I've pasted a bit of batctl ll batman; batctl log here:
...it's only showing the "originator packet from myself" lines and one line before. (the sample is less than 5 secs of logs)
Every node I checked is showing the same.
Last time this happened it was due to a router that had been affected by a nearby lightning bolt. The switch went crazy. It took a while to detect it and the network was 15 nodes big. Now it's 40 and we are quite far away :)
If anyone has an idea of how to better test where the problem is originated, I'll be glad to hear it. Also if any batman devel wishes to log in to the net to check first hand, just let me know.
Cheers! Nico
PS: batman version is 2012.4
* Nicolás Echániz nicoechaniz@altermundi.net [13.11.2013 08:59]:
Am I the only one who has bumped into this (twice)?
I have also seen a lot of these messages with an indoor mesh, so no lightning involved 8-) but with v2013.04 this is gone. (same network).
bye, bastian
On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote:
- Nicolás Echániz nicoechaniz@altermundi.net [13.11.2013 08:59]:
Am I the only one who has bumped into this (twice)?
I have also seen a lot of these messages with an indoor mesh, so no lightning involved 8-) but with v2013.04 this is gone. (same network).
this message is the symptom of a loop. The causes can be gazillions.
Cheers,
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
El 13/11/13 05:01, Antonio Quartulli escribió:
On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote:
- Nicolás Echániz nicoechaniz@altermundi.net [13.11.2013
08:59]:
Am I the only one who has bumped into this (twice)?
I have also seen a lot of these messages with an indoor mesh, so no lightning involved 8-) but with v2013.04 this is gone. (same network).
this message is the symptom of a loop. The causes can be gazillions.
Well... it took about a week to finally find the node creating this problem. As before, it's failing hardware that caused the issue.
When this happens every node in the net is repeatedly showing that message. It is not the same with any "loop symptom" I believe... At least I've never seen this happen on every node being caused by something else.
I really would like to find out more about how this condition comes to happen and how to diagnose and prevent it. The whole batman-adv cloud dies when this happens and it's a pain in the ass to "debug".
All the failing routers are WR842ND. There are many more of the same model working just fine.
I now have three routers which produce this symptom, so if anyone who can understand the problem better is willing to test, I can set up a dedicated mini-test-bed.
Cheers, NicoEchániz
Hi Nico,
On Tue, Nov 26, 2013 at 12:56:29AM -0300, Nicolás Echániz wrote:
El 13/11/13 05:01, Antonio Quartulli escribió:
On Wed, Nov 13, 2013 at 09:04:05AM +0100, Bastian Bittorf wrote:
- Nicolás Echániz nicoechaniz@altermundi.net [13.11.2013
08:59]:
Am I the only one who has bumped into this (twice)?
I have also seen a lot of these messages with an indoor mesh, so no lightning involved 8-) but with v2013.04 this is gone. (same network).
this message is the symptom of a loop. The causes can be gazillions.
Well... it took about a week to finally find the node creating this problem. As before, it's failing hardware that caused the issue.
Interesting. Could you be more specific in which way the hardware fails? Does it reboot frequently? Does it send broken OGM packets?
Could you make a checksum of the flashed squashfs, does it differ from the one you've built?
When this happens every node in the net is repeatedly showing that message. It is not the same with any "loop symptom" I believe... At least I've never seen this happen on every node being caused by something else.
I really would like to find out more about how this condition comes to happen and how to diagnose and prevent it. The whole batman-adv cloud dies when this happens and it's a pain in the ass to "debug".
All the failing routers are WR842ND. There are many more of the same model working just fine.
We are also using quite a lot of 842NDs, 841NDs and 3600NDs, as well as some 741ND, 1043ND and 4300NDs. We've never had the issue of one broken node taking down the whole network yet, not in Hamburg, Kiel or Lübeck.
Would be interesting to figure out the differences between our setups. Maybe I missed it so far, did you say you were using bridge loop avoidance (we don't)? We are using batman-adv 2013.1.0 mostly with a few still on 2012.4.0 and some on 2013.4.0.
I now have three routers which produce this symptom, so if anyone who can understand the problem better is willing to test, I can set up a dedicated mini-test-bed.
Cheers, NicoEchániz
Cheers, Linus
b.a.t.m.a.n@lists.open-mesh.org