[B.A.T.M.A.N.] batman majareta? I can batctl ping but not ping

List overview All Threads
Download

newer

older

[B.A.T.M.A.N.] multicast packet...

[B.A.T.M.A.N.] Problems with...

Guido Iribarren

2 Jul 2012 2 Jul '12

1:30 p.m.

(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.

Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.

Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight

problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly

I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other

this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.

Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS

The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...

Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)

Cheers!

Gui

Attachments:

04.png (image/png — 137.6 KB)

Show replies by date

Guido Iribarren

2 Jul 2 Jul

1:57 p.m.

...

I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

Well, seems colmena is the uncooperative bathost another log: http://pastebin.com/FMD9Lieq that can be summarized as follows

### From COLMENA-CASA, can ping bochita but not ana ### From PEREYRA, can ping bochita but not ana ### From COLMENA, works perfect to both destinations

colmena-casa and pereyra must pass through colmena, which is for some reason allowing batctl pings , ogms , and whatnot passthrough in its way to ana, but no ICMP echo requests, or tcp traffic whatsoever if it's final destination is ana. if final destination is bochita, everything works as expected.

Any ideas?

I'm going to delay rebooting colmena as long as i can, in case someone comes up with an insightful test to run :)

Gui

Antonio Quartulli

2:36 p.m.

On Mon, Jul 02, 2012 at 10:57:57AM -0300, Guido Iribarren wrote:

...

...
I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

Well, seems colmena is the uncooperative bathost another log: http://pastebin.com/FMD9Lieq that can be summarized as follows

### From COLMENA-CASA, can ping bochita but not ana ### From PEREYRA, can ping bochita but not ana ### From COLMENA, works perfect to both destinations

colmena-casa and pereyra must pass through colmena, which is for some reason allowing batctl pings , ogms , and whatnot passthrough in its way to ana, but no ICMP echo requests, or tcp traffic whatsoever if it's final destination is ana. if final destination is bochita, everything works as expected.

Any ideas?

I'm going to delay rebooting colmena as long as i can, in case someone comes up with an insightful test to run :)

Hello!

Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)

Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)

Cheers,

...

Gui

-- Antonio Quartulli ..each of us alone is worth nothing.. Ernesto "Che" Guevara

Guido Iribarren

2:47 p.m.

Hello Antonio! thanks for your time,

On Mon, Jul 2, 2012 at 11:36 AM, Antonio Quartulli ordex@autistici.org wrote:

...

Hello!

Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)

unfortunately, no :(

root@colmena:~# batctl ll Error - can't open file '/sys/class/net/bat0/mesh/log_level': No such file or directory The option you called seems not to be compiled into your batman-adv kernel module.

Will compile that option on next firmware cooking :)

...

Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)

Problem is, it's not easy to reproduce. I haven't came across it for several weeks. Nicolas Echaniz told me he suffered it recently, but i don't think neither of us can spend the time to try to recreate it on purpose :(

An enabled debug support waiting for the bug to crop up will probably be the best we can wait for :)

Thanks!

Gui

Marek Lindner

3:52 p.m.

On Monday, July 02, 2012 16:36:04 Antonio Quartulli wrote:

...

Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)

You don't need the development version. I pushed these fixes into the latest batman-adv trunk package. If you update your package you should get them.

Cheers, Marek

Guido Iribarren

4:11 p.m.

Hi Marek! Just to confirm and avoid useless compiling PKG_VERSION:=2012.2.0 BATCTL_VERSION:=2012.2.0 PKG_MD5SUM:=68967ed1df709de18ab795722dde9341 BATCTL_MD5SUM:=7abd284098c514d3f2858e8a956c495e

~/trunk/feeds/packages/net/batman-adv$ svn info . Path: . URL: svn://svn.openwrt.org/openwrt/packages/net/batman-adv Repository Root: svn://svn.openwrt.org/openwrt Repository UUID: 3c298f89-4303-0410-b956-a3cf2f4a3e73 Revision: 32578 Node Kind: directory Schedule: normal Last Changed Author: marek Last Changed Rev: 32578 Last Changed Date: 2012-07-02 12:51:27 -0300 (Mon, 02 Jul 2012)

Given the date and the author ;) I assume this rev should do the trick, right?

Thanks a lot!

Gui

On Mon, Jul 2, 2012 at 12:52 PM, Marek Lindner lindner_marek@yahoo.de wrote:

...

On Monday, July 02, 2012 16:36:04 Antonio Quartulli wrote:

...
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible)

You don't need the development version. I pushed these fixes into the latest batman-adv trunk package. If you update your package you should get them.

Cheers, Marek

Marek Lindner

4:26 p.m.

On Monday, July 02, 2012 18:11:24 Guido Iribarren wrote:

...

Hi Marek! Just to confirm and avoid useless compiling PKG_VERSION:=2012.2.0 BATCTL_VERSION:=2012.2.0 PKG_MD5SUM:=68967ed1df709de18ab795722dde9341 BATCTL_MD5SUM:=7abd284098c514d3f2858e8a956c495e

~/trunk/feeds/packages/net/batman-adv$ svn info . Path: . URL: svn://svn.openwrt.org/openwrt/packages/net/batman-adv Repository Root: svn://svn.openwrt.org/openwrt Repository UUID: 3c298f89-4303-0410-b956-a3cf2f4a3e73 Revision: 32578 Node Kind: directory Schedule: normal Last Changed Author: marek Last Changed Rev: 32578 Last Changed Date: 2012-07-02 12:51:27 -0300 (Mon, 02 Jul 2012)

Given the date and the author ;) I assume this rev should do the trick, right?

Yes, that looks about right. If you wish to update the package and not the full image you should update one more time because Jow reminded me to increase the packet version.

Cheers, Marek

Guido Iribarren

20 Jul 20 Jul

8:25 p.m.

Resurrecting thread...

On Mon, Jul 2, 2012 at 11:36 AM, Antonio Quartulli ordex@autistici.org wrote:

...

Hello!

Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)

Ah, I should have re-read this before :(

...

Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible) -- Antonio Quartulli

Last week I came across this bug again, with the latest firm which includes the fixes mentioned, pushed by Marek. We were kinda in a hurry so i didn't have much time to check it thoroughly, so there's a *slim* chance it was just a coincidence, such as very poor signal giving erratic results. But if I recall correctly Nico Echaniz did stump on this too, using the latest firm. So, although i can't confirm it 100%, it seems so far the fixes didn't help :(

We'll keep an eye on it and try a "batctl l"

Cheers!

Gui

Antonio Quartulli

21 Jul 21 Jul

9:38 p.m.

Hi Guido,

On Fri, Jul 20, 2012 at 05:25:46PM -0300, Guido Iribarren wrote:

...

Resurrecting thread...

On Mon, Jul 2, 2012 at 11:36 AM, Antonio Quartulli ordex@autistici.org wrote:

...
Hello!

Has debug support been compiled in batman-adv? IF yes, it would be interesting so see the output of the tt log (batctl ll tt; batctl l)

Ah, I should have re-read this before :(

...
Recently we fixed a bug that which fix has not been released yet. If we are sure that this is the cause, you could eventually try an upgrade to a more recente dev-version. But let's see the log first (if possible) -- Antonio Quartulli

Last week I came across this bug again, with the latest firm which includes the fixes mentioned, pushed by Marek. We were kinda in a hurry so i didn't have much time to check it thoroughly, so there's a *slim* chance it was just a coincidence, such as very poor signal giving erratic results. But if I recall correctly Nico Echaniz did stump on this too, using the latest firm.

How did you solve it then? Rebooting?

...

So, although i can't confirm it 100%, it seems so far the fixes didn't help :(

We'll keep an eye on it and try a "batctl l"

Yes, please. Remember to set the TT log level (batctl ll tt) before launching batctl l. Actually it would be very interesting to see the log of the involved nodes during the "wrong behaviour period".

However, please keep an eye on the log anyway and report if you get any message matching "*inconsistency*" (but report the whole part of the log, not only this message). When you see those messages, please be sure that no clients is connecting at that time (if so, it could be the normal procedure). If you get this message, you should also see which node is involved in the inconsistency (it is reported in the message too) and please report the tt log from that node too.

Thank you very much!

-- Antonio Quartulli ..each of us alone is worth nothing.. Ernesto "Che" Guevara

Guido Iribarren

22 Jul 22 Jul

10:57 a.m.

On Sat, Jul 21, 2012 at 6:38 PM, Antonio Quartulli ordex@autistici.org wrote:

...

...
Last week I came across this bug again, with the latest firm which includes the fixes mentioned, pushed by Marek. We were kinda in a hurry so i didn't have much time to check it thoroughly, so there's a *slim* chance it was just a coincidence, such as very poor signal giving erratic results. But if I recall correctly Nico Echaniz did stump on this too, using the latest firm.

How did you solve it then? Rebooting?

A reboot did, yes.

...

...
So, although i can't confirm it 100%, it seems so far the fixes didn't help :(

We'll keep an eye on it and try a "batctl l"

Yes, please. Remember to set the TT log level (batctl ll tt) before launching batctl l. Actually it would be very interesting to see the log of the involved nodes during the "wrong behaviour period".

This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i

sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless. But at least now I got an idea where we are heading :)

...

Thank you very much!

Thanks a lot for your support people!

Gui

Guido Iribarren

11:20 a.m.

On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:

...

This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i

sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.

from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)

In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)

Antonio Quartulli

23 Jul 23 Jul

5:28 p.m.

On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:

...

On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:

...
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i

sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.

from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)

Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)

...

In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)

Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.

it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.

Cheers,

-- Antonio Quartulli ..each of us alone is worth nothing.. Ernesto "Che" Guevara

Gui Iribarren

5 Aug 5 Aug

5:34 a.m.

On Mon, Jul 23, 2012 at 2:28 PM, Antonio Quartulli ordex@autistici.org wrote:

...

On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:

...
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:

...
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i

sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.

from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)

Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)

...
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)

Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.

it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.

I admit i haven't left this running as instructed, but on the other hand, so far I haven't come across the original bug again, and a few days ago I asked Nico Echaniz which confirmed that he's not suffering it as previously. he does bump from time to time with [a few moments | a few minutes] of "nodes majaretas" (at first sight) but it resolves by itself quickly[*], which indicates normal behaviour, of missing OGMs and consequently a delay in TT table updating, as you explained.

[*] "quickly" means under 15 minutes , at most. Previously, problem would never resolve by itself, being L3-unreachable for hours or days until manual reboot was done.

In conclusion, so far so good, i think we can close this as fixed for lack of evidence stating the contrary, heh. I hope gioacchino managed to recompile ninux images and is having the same stableness as we do :)

Gui

Antonio Quartulli

7:58 a.m.

On Sun, Aug 05, 2012 at 02:34:15AM -0300, Gui Iribarren wrote:

...

On Mon, Jul 23, 2012 at 2:28 PM, Antonio Quartulli ordex@autistici.org wrote:

...
On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:

...
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:

...
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i

sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.

from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)

Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)

...
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)

Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.

it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.

I admit i haven't left this running as instructed, but on the other hand, so far I haven't come across the original bug again, and a few days ago I asked Nico Echaniz which confirmed that he's not suffering it as previously. he does bump from time to time with [a few moments | a few minutes] of "nodes majaretas" (at first sight) but it resolves by itself quickly[*], which indicates normal behaviour, of missing OGMs and consequently a delay in TT table updating, as you explained.

[*] "quickly" means under 15 minutes , at most. Previously, problem would never resolve by itself, being L3-unreachable for hours or days until manual reboot was done.

In conclusion, so far so good, i think we can close this as fixed for lack of evidence stating the contrary, heh. I hope gioacchino managed to recompile ninux images and is having the same stableness as we do :)

Gui

Hello Guido and thank you for reporting back your results :) However, even if the "behaviour" is good (table gets recovered and everything starts working again) it is a bit strange that it takes 15 minutes to do so.

If you accidentally see the bug, it would be interesting to get the log of the "non-working" node and see why it is taking so long.

Thank you very much!

Cheers,

-- Antonio Quartulli ..each of us alone is worth nothing.. Ernesto "Che" Guevara

Gioacchino Mazzurco

2 Jul 2 Jul

4:39 p.m.

That bug was happening in Pisa some times I have discussed about that antonio too

hope more test case can help to understand what is happening!

On 07/02/12 15:30, Guido Iribarren wrote:

...

(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.

Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.

Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight

problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly

I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other

this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.

Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS

The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...

Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)

Cheers!

Gui

Antonio Quartulli

4:42 p.m.

On Mon, Jul 02, 2012 at 06:39:49PM +0200, Gioacchino Mazzurco wrote:

...

That bug was happening in Pisa some times I have discussed about that antonio too

yeah, it was pretty much the same! I hope Guido can give us good results after testing the new patches :-)

You may want to give them a try too?? :):)

Cheers,

-- Antonio Quartulli ..each of us alone is worth nothing.. Ernesto "Che" Guevara

Nicolás Echániz

3 Jul 3 Jul

7:34 a.m.

On 07/02/2012 01:42 PM, Antonio Quartulli wrote:

...

On Mon, Jul 02, 2012 at 06:39:49PM +0200, Gioacchino Mazzurco wrote:

...
That bug was happening in Pisa some times I have discussed about that antonio too

yeah, it was pretty much the same! I hope Guido can give us good results after testing the new patches :-)

You may want to give them a try too?? :):)

I just wanted to confirm that I've come across this bug quite often but my setup is less tidy than guido's so it's more complex to debug.

I can add that quite recently we started in a nearby town a new WCN project and we hit this bug the same day we put the first two nodes online; they could bat-ping alright but no ping at all. All started working after a reboot of one of the nodes.

Guido noted that this bug is frequently apparent when we configure a node in some point of the mesh (an admin's home for instance) and then move this node to it's final location. If I get the time to do so I'll try to test if this is really the case or just a coincidence so far.

Cheers, NicoEchániz

Wayne Abroue

7:52 a.m.

On Tue, Jul 3, 2012 at 9:34 AM, Nicolás Echániz nicoechaniz@codigosur.org wrote:

...

On 07/02/2012 01:42 PM, Antonio Quartulli wrote:

...
On Mon, Jul 02, 2012 at 06:39:49PM +0200, Gioacchino Mazzurco wrote:

...
That bug was happening in Pisa some times I have discussed about that antonio too

yeah, it was pretty much the same! I hope Guido can give us good results after testing the new patches :-)

You may want to give them a try too?? :):)

I just wanted to confirm that I've come across this bug quite often but my setup is less tidy than guido's so it's more complex to debug.

I can add that quite recently we started in a nearby town a new WCN project and we hit this bug the same day we put the first two nodes online; they could bat-ping alright but no ping at all. All started working after a reboot of one of the nodes.

Guido noted that this bug is frequently apparent when we configure a node in some point of the mesh (an admin's home for instance) and then move this node to it's final location. If I get the time to do so I'll try to test if this is really the case or just a coincidence so far.

Admittedly, using the older version, In my 25 node mesh, I have also wondered why nodes seemingly disappear without trace when doing a nmap. As L2 throughput still works I haven't bothered to investigate. On the upgrade note, Is there a way to upgrade to 2012 without reflashing the node?

Wayne A

Marek Lindner

8:07 a.m.

On Tuesday, July 03, 2012 09:52:09 Wayne Abroue wrote:

...

Admittedly, using the older version, In my 25 node mesh, I have also wondered why nodes seemingly disappear without trace when doing a nmap. As L2 throughput still works I haven't bothered to investigate. On the upgrade note, Is there a way to upgrade to 2012 without reflashing the node?

You can build a new package and install that. Note that you should build this package with the exact same build environment you currently have running.

Regards, Marek

Wayne Abroue

8:27 a.m.

On Tue, Jul 3, 2012 at 10:07 AM, Marek Lindner lindner_marek@yahoo.de wrote:

...

On Tuesday, July 03, 2012 09:52:09 Wayne Abroue wrote:

...
Admittedly, using the older version, In my 25 node mesh, I have also wondered why nodes seemingly disappear without trace when doing a nmap. As L2 throughput still works I haven't bothered to investigate. On the upgrade note, Is there a way to upgrade to 2012 without reflashing the node?

You can build a new package and install that. Note that you should build this package with the exact same build environment you currently have running.

Thanks Marek, Unfortunately all my nodes run one or other default openwrt version depending on ubnt/Mp/wrt driver compat . Would it maybe be viable to add a package to older versions of openwrt repo's i.e. Batman-adv-new_stable? To make upgrading a easier exercise for us non-build orientated types.

Wayne

...

Regards, Marek

Marek Lindner

8:37 a.m.

On Tuesday, July 03, 2012 10:27:55 Wayne Abroue wrote:

...

Thanks Marek, Unfortunately all my nodes run one or other default openwrt version depending on ubnt/Mp/wrt driver compat . Would it maybe be viable to add a package to older versions of openwrt repo's i.e. Batman-adv-new_stable?

I don't quite follow you. Your nodes run pre-compiled images from the openwrt snapshot directory ? If so, there isn't much we can do. As far as I know the openwrt team builds snapshots from time to time. Whenever they do they also upgrade the entire platform - these package may or may not be backward compatible.

Take this with a grain of salt. I don't really know how they are doing it. You should contact the OpenWrt developers because you use their binaries (unless I am totally on the wrong track).

Regards, Marek

Gioacchino Mazzurco

21 Jul 21 Jul

6:54 p.m.

Same bug today in ninux pisa after a node was turned off the entire network became crazy for 2 hours, to solve i had to restart a lot of nodes... :|

On 07/02/12 15:30, Guido Iribarren wrote:

...

(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.

Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.

Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight

problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly

I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other

this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.

Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS

The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...

Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)

Cheers!

Gui

Antonio Quartulli

9:40 p.m.

Hello Gioacchino,

On Sat, Jul 21, 2012 at 08:54:05PM +0200, Gioacchino Mazzurco wrote:

...

Same bug today in ninux pisa after a node was turned off the entire network became crazy for 2 hours, to solve i had to restart a lot of nodes... :|

Which version are you using? The lastest openwrt package version (so with all the new patches?)

Could you provide the log of the involved nodes whenever you get this problems? I wrote something about the desired logs to Guido, you could follow the same instruction. It would really be appreciated!

Thank you!

Cheers,

...

On 07/02/12 15:30, Guido Iribarren wrote:

...
(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.

Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.

Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight

problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly

I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other

this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.

Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS

The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...

Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)

Cheers!

Gui

-- Antonio Quartulli ..each of us alone is worth nothing.. Ernesto "Che" Guevara

Gioacchino Mazzurco

22 Jul 22 Jul

10:54 a.m.

I'll compile batman-adv with debug support for next firmware update i hope it will not affect performance too much...

On 07/21/12 23:40, Antonio Quartulli wrote:

...

Hello Gioacchino,

On Sat, Jul 21, 2012 at 08:54:05PM +0200, Gioacchino Mazzurco wrote:

...
Same bug today in ninux pisa after a node was turned off the entire network became crazy for 2 hours, to solve i had to restart a lot of nodes... :|

Which version are you using? The lastest openwrt package version (so with all the new patches?)

Could you provide the log of the involved nodes whenever you get this problems? I wrote something about the desired logs to Guido, you could follow the same instruction. It would really be appreciated!

Thank you!

Cheers,

...
On 07/02/12 15:30, Guido Iribarren wrote:

...
(which roughly translates as "batman gone nuts?") Hey great devs! we've been having a particular issue in deltalibre and quintanalibre (local WCN) with batman-adv, but so far we haven't found a precise way to reproduce it. The symptom is that (after some reboots or physical displacements?) one batman-adv host becomes unreachable on layer3, although it is seen on originators table, and can be batctl ping'ed or batctl tracerout'ed with no problem whatsoever.

Even more, it not unreachable from the whole network, but instead from just a few other nodes. So, let's say that the nearer nodes can layer3 ping it , but some others farther away cannot (although i can't assure it depends on the hop distance) All of them can batctl ping it (layer2) A hard reboot of all the nodes solves it, connectivity is restored in all directions.

Thing is, I've just came across it again, and managed to do some tests to aid in description / debugging As an aid in understanding network topology, I'm attaching the wonderful output of "batctl vd dot |grep -v TT" for your viewing delight

problem node is ana it can be reached from ruth and hquilla (direct neighbours) but arping behaves erratically from colmena or charly and normal ping (v4 or v6) doesn't receive any reply at all when run from colmena or charly

I used arping, with and without -b , and seemed like i could narrow the problem down to incoming broadcast packet handling, but further tests just left me more puzzled!

all nodes are tl-mr3220 running openwrt trunk r31316 with batman-adv 2012.2.0 , driver ath9k secondary interfaces named _wlan1 are all tl-wn722n which uses driver ath9k_htc nodes are around 100meters (+/-50mts) apart from each other

this behaviour has been observed (but not reported) in dissimilar setups, using ubnt bullet2 mixed with mr3220, running r29936 with batman-adv 2011.4.0 , with nodes 1 or 2km apart from each other.

Tests are the combined crude output of batctl td and arping, so to make this email ease on the eye, i'm publishing them elsewhere: http://pastebin.com/6PPwN3PS

The live openwrt configuration can be analysed in detail at https://bitbucket.org/guidoi/deltalibre-configs/src (it's a free, open network after all! :D ) in particular: ana -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... hquilla -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F... colmena -> https://bitbucket.org/guidoi/deltalibre-configs/src/6de4ce970fe2/mac/54_E6_F...

Thanks a lot for the attention, Hope that you are having fun, and that I'm not spoiling it :)

Cheers!

Gui

4570

Age (days ago)

4604

Last active (days ago)

b.a.t.m.a.n@lists.open-mesh.org

23 comments

8 participants

tags (0)

participants (8)

Antonio Quartulli
Gioacchino Mazzurco
Gioacchino Mazzurco
Gui Iribarren
Guido Iribarren
Marek Lindner
Nicolás Echániz
Wayne Abroue