(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.
Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.
Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight
problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other
this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.
Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS
The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...
Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)
Cheers!
Gui
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
Well, seems colmena is the uncooperative bathost another log: http://pastebin.com/FMD9Lieq that can be summarized as follows
### From COLMENA-CASA, can ping bochita but not ana ### From PEREYRA, can ping bochita but not ana ### From COLMENA, works perfect to both destinations
colmena-casa and pereyra must pass through colmena, which is for some reason allowing batctl pings , ogms , and whatnot passthrough in its way to ana, but no ICMP echo requests, or tcp traffic whatsoever if it's final destination is ana. if final destination is bochita, everything works as expected.
Any ideas?
I'm going to delay rebooting colmena as long as i can, in case someone comes up with an insightful test to run :)
Gui
On Mon, Jul 02, 2012 at 10:57:57AM -0300, Guido Iribarren wrote:
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
Well, seems colmena is the uncooperative bathost another log: http://pastebin.com/FMD9Lieq that can be summarized as follows
### From COLMENA-CASA, can ping bochita but not ana ### From PEREYRA, can ping bochita but not ana ### From COLMENA, works perfect to both destinations
colmena-casa and pereyra must pass through colmena, which is for some reason allowing batctl pings , ogms , and whatnot passthrough in its way to ana, but no ICMP echo requests, or tcp traffic whatsoever if it's final destination is ana. if final destination is bochita, everything works as expected.
Any ideas?
I'm going to delay rebooting colmena as long as i can, in case someone comes up with an insightful test to run :)
Hello!
Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)
Cheers,
Gui
Hello Antonio! thanks for your time,
On Mon, Jul 2, 2012 at 11:36 AM, Antonio Quartulli ordex@autistici.org wrote:
Hello!
Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)
unfortunately, no :(
root@colmena:~# batctl ll Error - can't open file '/sys/class/net/bat0/mesh/log_level': No such file or directory The option you called seems not to be compiled into your batman-adv kernel module.
Will compile that option on next firmware cooking :)
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)
Problem is, it's not easy to reproduce. I haven't came across it for several weeks. Nicolas Echaniz told me he suffered it recently, but i don't think neither of us can spend the time to try to recreate it on purpose :(
An enabled debug support waiting for the bug to crop up will probably be the best we can wait for :)
Thanks!
Gui
On Monday, July 02, 2012 16:36:04 Antonio Quartulli wrote:
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)
You don't need the development version. I pushed these fixes into the latest batman-adv trunk package. If you update your package you should get them.
Cheers, Marek
Hi Marek! Just to confirm and avoid useless compiling PKG_VERSION:=2012.2.0 BATCTL_VERSION:=2012.2.0 PKG_MD5SUM:=68967ed1df709de18ab795722dde9341 BATCTL_MD5SUM:=7abd284098c514d3f2858e8a956c495e
~/trunk/feeds/packages/net/batman-adv$ svn info . Path: . URL: svn://svn.openwrt.org/openwrt/packages/net/batman-adv Repository Root: svn://svn.openwrt.org/openwrt Repository UUID: 3c298f89-4303-0410-b956-a3cf2f4a3e73 Revision: 32578 Node Kind: directory Schedule: normal Last Changed Author: marek Last Changed Rev: 32578 Last Changed Date: 2012-07-02 12:51:27 -0300 (Mon, 02 Jul 2012)
Given the date and the author ;) I assume this rev should do the trick, right?
Thanks a lot!
Gui
On Mon, Jul 2, 2012 at 12:52 PM, Marek Lindner lindner_marek@yahoo.de wrote:
On Monday, July 02, 2012 16:36:04 Antonio Quartulli wrote:
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)
You don't need the development version. I pushed these fixes into the latest batman-adv trunk package. If you update your package you should get them.
Cheers, Marek
On Monday, July 02, 2012 18:11:24 Guido Iribarren wrote:
Hi Marek! Just to confirm and avoid useless compiling PKG_VERSION:=2012.2.0 BATCTL_VERSION:=2012.2.0 PKG_MD5SUM:=68967ed1df709de18ab795722dde9341 BATCTL_MD5SUM:=7abd284098c514d3f2858e8a956c495e
~/trunk/feeds/packages/net/batman-adv$ svn info . Path: . URL: svn://svn.openwrt.org/openwrt/packages/net/batman-adv Repository Root: svn://svn.openwrt.org/openwrt Repository UUID: 3c298f89-4303-0410-b956-a3cf2f4a3e73 Revision: 32578 Node Kind: directory Schedule: normal Last Changed Author: marek Last Changed Rev: 32578 Last Changed Date: 2012-07-02 12:51:27 -0300 (Mon, 02 Jul 2012)
Given the date and the author ;) I assume this rev should do the trick, right?
Yes, that looks about right. If you wish to update the package and not the full image you should update one more time because Jow reminded me to increase the packet version.
Cheers, Marek
Resurrecting thread...
On Mon, Jul 2, 2012 at 11:36 AM, Antonio Quartulli ordex@autistici.org wrote:
Hello!
Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)
Ah, I should have re-read this before :(
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible) -- Antonio Quartulli
Last week I came across this bug again, with the latest firm which includes the fixes mentioned, pushed by Marek. We were kinda in a hurry so i didn't have much time to check it thoroughly, so there's a *slim* chance it was just a coincidence, such as very poor signal giving erratic results. But if I recall correctly Nico Echaniz did stump on this too, using the latest firm. So, although i can't confirm it 100%, it seems so far the fixes didn't help :(
We'll keep an eye on it and try a "batctl l"
Cheers!
Gui
Hi Guido,
On Fri, Jul 20, 2012 at 05:25:46PM -0300, Guido Iribarren wrote:
Resurrecting thread...
On Mon, Jul 2, 2012 at 11:36 AM, Antonio Quartulli ordex@autistici.org wrote:
Hello!
Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)
Ah, I should have re-read this before :(
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible) -- Antonio Quartulli
Last week I came across this bug again, with the latest firm which includes the fixes mentioned, pushed by Marek. We were kinda in a hurry so i didn't have much time to check it thoroughly, so there's a *slim* chance it was just a coincidence, such as very poor signal giving erratic results. But if I recall correctly Nico Echaniz did stump on this too, using the latest firm.
How did you solve it then? Rebooting?
So, although i can't confirm it 100%, it seems so far the fixes didn't help :(
We'll keep an eye on it and try a "batctl l"
Yes, please. Remember to set the TT log level (batctl ll tt) before launching batctl l. Actually it would be very interesting to see the log of the involved nodes during the "wrong behaviour period".
However, please keep an eye on the log anyway and report if you get any message matching "*inconsistency*" (but report the whole part of the log, not only this message). When you see those messages, please be sure that no clients is connecting at that time (if so, it could be the normal procedure). If you get this message, you should also see which node is involved in the inconsistency (it is reported in the message too) and please report the tt log from that node too.
Thank you very much!
On Sat, Jul 21, 2012 at 6:38 PM, Antonio Quartulli ordex@autistici.org wrote:
Last week I came across this bug again, with the latest firm which includes the fixes mentioned, pushed by Marek. We were kinda in a hurry so i didn't have much time to check it thoroughly, so there's a *slim* chance it was just a coincidence, such as very poor signal giving erratic results. But if I recall correctly Nico Echaniz did stump on this too, using the latest firm.
How did you solve it then? Rebooting?
A reboot did, yes.
So, although i can't confirm it 100%, it seems so far the fixes didn't help :(
We'll keep an eye on it and try a "batctl l"
Yes, please. Remember to set the TT log level (batctl ll tt) before launching batctl l. Actually it would be very interesting to see the log of the involved nodes during the "wrong behaviour period".
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i
sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless. But at least now I got an idea where we are heading :)
Thank you very much!
Thanks a lot for your support people!
Gui
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i
sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.
from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)
On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i
sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.
from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)
Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)
Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.
it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.
Cheers,
On Mon, Jul 23, 2012 at 2:28 PM, Antonio Quartulli ordex@autistici.org wrote:
On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i
sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.
from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)
Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)
Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.
it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.
I admit i haven't left this running as instructed, but on the other hand, so far I haven't come across the original bug again, and a few days ago I asked Nico Echaniz which confirmed that he's not suffering it as previously. he does bump from time to time with [a few moments | a few minutes] of "nodes majaretas" (at first sight) but it resolves by itself quickly[*], which indicates normal behaviour, of missing OGMs and consequently a delay in TT table updating, as you explained.
[*] "quickly" means under 15 minutes , at most. Previously, problem would never resolve by itself, being L3-unreachable for hours or days until manual reboot was done.
In conclusion, so far so good, i think we can close this as fixed for lack of evidence stating the contrary, heh. I hope gioacchino managed to recompile ninux images and is having the same stableness as we do :)
Gui
On Sun, Aug 05, 2012 at 02:34:15AM -0300, Gui Iribarren wrote:
On Mon, Jul 23, 2012 at 2:28 PM, Antonio Quartulli ordex@autistici.org wrote:
On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i
sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.
from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)
Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)
Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.
it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.
I admit i haven't left this running as instructed, but on the other hand, so far I haven't come across the original bug again, and a few days ago I asked Nico Echaniz which confirmed that he's not suffering it as previously. he does bump from time to time with [a few moments | a few minutes] of "nodes majaretas" (at first sight) but it resolves by itself quickly[*], which indicates normal behaviour, of missing OGMs and consequently a delay in TT table updating, as you explained.
[*] "quickly" means under 15 minutes , at most. Previously, problem would never resolve by itself, being L3-unreachable for hours or days until manual reboot was done.
In conclusion, so far so good, i think we can close this as fixed for lack of evidence stating the contrary, heh. I hope gioacchino managed to recompile ninux images and is having the same stableness as we do :)
Gui
Hello Guido and thank you for reporting back your results :) However, even if the "behaviour" is good (table gets recovered and everything starts working again) it is a bit strange that it takes 15 minutes to do so.
If you accidentally see the bug, it would be interesting to get the log of the "non-working" node and see why it is taking so long.
Thank you very much!
Cheers,
That bug was happening in Pisa some times I have discussed about that antonio too
hope more test case can help to understand what is happening!
On 07/02/12 15:30, Guido Iribarren wrote:
(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.
Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.
Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight
problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other
this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.
Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS
The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...
Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)
Cheers!
Gui
On Mon, Jul 02, 2012 at 06:39:49PM +0200, Gioacchino Mazzurco wrote:
That bug was happening in Pisa some times I have discussed about that antonio too
yeah, it was pretty much the same! I hope Guido can give us good results after testing the new patches :-)
You may want to give them a try too?? :):)
Cheers,
On 07/02/2012 01:42 PM, Antonio Quartulli wrote:
On Mon, Jul 02, 2012 at 06:39:49PM +0200, Gioacchino Mazzurco wrote:
That bug was happening in Pisa some times I have discussed about that antonio too
yeah, it was pretty much the same! I hope Guido can give us good results after testing the new patches :-)
You may want to give them a try too?? :):)
I just wanted to confirm that I've come across this bug quite often but my setup is less tidy than guido's so it's more complex to debug.
I can add that quite recently we started in a nearby town a new WCN project and we hit this bug the same day we put the first two nodes online; they could bat-ping alright but no ping at all. All started working after a reboot of one of the nodes.
Guido noted that this bug is frequently apparent when we configure a node in some point of the mesh (an admin's home for instance) and then move this node to it's final location. If I get the time to do so I'll try to test if this is really the case or just a coincidence so far.
Cheers, NicoEchániz
On Tue, Jul 3, 2012 at 9:34 AM, Nicolás Echániz nicoechaniz@codigosur.org wrote:
On 07/02/2012 01:42 PM, Antonio Quartulli wrote:
On Mon, Jul 02, 2012 at 06:39:49PM +0200, Gioacchino Mazzurco wrote:
That bug was happening in Pisa some times I have discussed about that antonio too
yeah, it was pretty much the same! I hope Guido can give us good results after testing the new patches :-)
You may want to give them a try too?? :):)
I just wanted to confirm that I've come across this bug quite often but my setup is less tidy than guido's so it's more complex to debug.
I can add that quite recently we started in a nearby town a new WCN project and we hit this bug the same day we put the first two nodes online; they could bat-ping alright but no ping at all. All started working after a reboot of one of the nodes.
Guido noted that this bug is frequently apparent when we configure a node in some point of the mesh (an admin's home for instance) and then move this node to it's final location. If I get the time to do so I'll try to test if this is really the case or just a coincidence so far.
Admittedly, using the older version, In my 25 node mesh, I have also wondered why nodes seemingly disappear without trace when doing a nmap. As L2 throughput still works I haven't bothered to investigate. On the upgrade note, Is there a way to upgrade to 2012 without reflashing the node?
Wayne A
On Tuesday, July 03, 2012 09:52:09 Wayne Abroue wrote:
Admittedly, using the older version, In my 25 node mesh, I have also wondered why nodes seemingly disappear without trace when doing a nmap. As L2 throughput still works I haven't bothered to investigate. On the upgrade note, Is there a way to upgrade to 2012 without reflashing the node?
You can build a new package and install that. Note that you should build this package with the exact same build environment you currently have running.
Regards, Marek
On Tue, Jul 3, 2012 at 10:07 AM, Marek Lindner lindner_marek@yahoo.de wrote:
On Tuesday, July 03, 2012 09:52:09 Wayne Abroue wrote:
Admittedly, using the older version, In my 25 node mesh, I have also wondered why nodes seemingly disappear without trace when doing a nmap. As L2 throughput still works I haven't bothered to investigate. On the upgrade note, Is there a way to upgrade to 2012 without reflashing the node?
You can build a new package and install that. Note that you should build this package with the exact same build environment you currently have running.
Thanks Marek, Unfortunately all my nodes run one or other default openwrt version depending on ubnt/Mp/wrt driver compat . Would it maybe be viable to add a package to older versions of openwrt repo's i.e. Batman-adv-new_stable? To make upgrading a easier exercise for us non-build orientated types.
Wayne
Regards, Marek
On Tuesday, July 03, 2012 10:27:55 Wayne Abroue wrote:
Thanks Marek, Unfortunately all my nodes run one or other default openwrt version depending on ubnt/Mp/wrt driver compat . Would it maybe be viable to add a package to older versions of openwrt repo's i.e. Batman-adv-new_stable?
I don't quite follow you. Your nodes run pre-compiled images from the openwrt snapshot directory ? If so, there isn't much we can do. As far as I know the openwrt team builds snapshots from time to time. Whenever they do they also upgrade the entire platform - these package may or may not be backward compatible.
Take this with a grain of salt. I don't really know how they are doing it. You should contact the OpenWrt developers because you use their binaries (unless I am totally on the wrong track).
Regards, Marek
Same bug today in ninux pisa after a node was turned off the entire network became crazy for 2 hours, to solve i had to restart a lot of nodes... :|
On 07/02/12 15:30, Guido Iribarren wrote:
(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.
Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.
Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight
problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other
this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.
Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS
The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...
Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)
Cheers!
Gui
Hello Gioacchino,
On Sat, Jul 21, 2012 at 08:54:05PM +0200, Gioacchino Mazzurco wrote:
Same bug today in ninux pisa after a node was turned off the entire network became crazy for 2 hours, to solve i had to restart a lot of nodes... :|
Which version are you using? The lastest openwrt package version (so with all the new patches?)
Could you provide the log of the involved nodes whenever you get this problems? I wrote something about the desired logs to Guido, you could follow the same instruction. It would really be appreciated!
Thank you!
Cheers,
On 07/02/12 15:30, Guido Iribarren wrote:
(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.
Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.
Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight
problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other
this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.
Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS
The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...
Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)
Cheers!
Gui
I'll compile batman-adv with debug support for next firmware update i hope it will not affect performance too much...
On 07/21/12 23:40, Antonio Quartulli wrote:
Hello Gioacchino,
On Sat, Jul 21, 2012 at 08:54:05PM +0200, Gioacchino Mazzurco wrote:
Same bug today in ninux pisa after a node was turned off the entire network became crazy for 2 hours, to solve i had to restart a lot of nodes... :|
Which version are you using? The lastest openwrt package version (so with all the new patches?)
Could you provide the log of the involved nodes whenever you get this problems? I wrote something about the desired logs to Guido, you could follow the same instruction. It would really be appreciated!
Thank you!
Cheers,
On 07/02/12 15:30, Guido Iribarren wrote:
(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.
Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.
Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight
problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!
all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other
this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.
Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS
The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...
Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)
Cheers!
Gui
b.a.t.m.a.n@lists.open-mesh.org