Hi all,
i have setup a mesh-network with batman-adv running on about 10 Foneras and 4 TP-Link on OpenWRT.
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
dmesg or the syslog does not report any errors.
Can anyone give me a hint where to look?
Tobias
Es-tu Tobias de tai?
Hi all,
i have setup a mesh-network with batman-adv running on about 10 Foneras and 4 TP-Link on OpenWRT.
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
dmesg or the syslog does not report any errors.
Can anyone give me a hint where to look?
Tobias
Hi,
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
since "batctl ping" works I'd say your mesh works fine - you have a problem in your higher layers. Maybe a mac address collision or an ARP timeout ?
Can you provide specific examples we can go through ? For instance, provide the batctl ping output to the neighbor in question, the ping error message (does it say timeout / host could not be found / etc), a batctl traceroute to the neighbor in question and the output of the global translation table.
Are you trying to ping a 'fixed' node or a node that is roaming ?
Regards, Marek
On Wed, Aug 31, 2011 at 11:55 AM, Marek Lindner lindner_marek@yahoo.de wrote:
Hi,
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
since "batctl ping" works I'd say your mesh works fine - you have a problem in your higher layers. Maybe a mac address collision or an ARP timeout ?
Can you provide specific examples we can go through ? For instance, provide the batctl ping output to the neighbor in question, the ping error message (does it say timeout / host could not be found / etc), a batctl traceroute to the neighbor in question and the output of the global translation table.
Are you trying to ping a 'fixed' node or a node that is roaming ?
Regards, Marek
I'd also check signal strength,, have experienced this when levels are fluctuating, ie: batctl ping works, ip not. then comes back.
Wayne A
On Wed, Aug 31, 2011 at 05:57:44PM +0200, wayne abroue wrote:
On Wed, Aug 31, 2011 at 11:55 AM, Marek Lindner lindner_marek@yahoo.de wrote:
Hi,
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
since "batctl ping" works I'd say your mesh works fine - you have a problem in your higher layers. Maybe a mac address collision or an ARP timeout ?
Can you provide specific examples we can go through ? For instance, provide the batctl ping output to the neighbor in question, the ping error message (does it say timeout / host could not be found / etc), a batctl traceroute to the neighbor in question and the output of the global translation table.
Are you trying to ping a 'fixed' node or a node that is roaming ?
Regards, Marek
I'd also check signal strength,, have experienced this when levels are fluctuating, ie: batctl ping works, ip not. then comes back.
Could it be a TT problem? Please, try to enable TT related log only (using "batctl ll x", I can't remember the correct x value) and copy/paste the output of "batctl l" during the blackout time to the moment when everything restart to work.
On Wednesday, August 31, 2011 18:07:13 Antonio Quartulli wrote:
Could it be a TT problem? Please, try to enable TT related log only (using "batctl ll x", I can't remember the correct x value) and copy/paste the output of "batctl l" during the blackout time to the moment when everything restart to work.
It is honorable that you want to take the blame but 2011.2.0 does not have the tt patches yet. ;-)
Cheers, Marek
On Wed, Aug 31, 2011 at 06:16:22PM +0200, Marek Lindner wrote:
On Wednesday, August 31, 2011 18:07:13 Antonio Quartulli wrote:
Could it be a TT problem? Please, try to enable TT related log only (using "batctl ll x", I can't remember the correct x value) and copy/paste the output of "batctl l" during the blackout time to the moment when everything restart to work.
It is honorable that you want to take the blame but 2011.2.0 does not have the tt patches yet. ;-)
Ops, sorry! I always confuse the release numbers :-) (phew)
cheers, Antonio
Am 31.08.2011 18:18, schrieb Antonio Quartulli:
On Wed, Aug 31, 2011 at 06:16:22PM +0200, Marek Lindner wrote:
On Wednesday, August 31, 2011 18:07:13 Antonio Quartulli wrote:
Could it be a TT problem? Please, try to enable TT related log only (using "batctl ll x", I can't remember the correct x value) and copy/paste the output of "batctl l" during the blackout time to the moment when everything restart to work.
It is honorable that you want to take the blame but 2011.2.0 does not have the tt patches yet. ;-)
Ops, sorry! I always confuse the release numbers :-) (phew)
cheers, Antonio
Sorry, I mixed the versions i installed. I have the problems with 2011.3.0. Before - I tried with 2011.2.0 but when i flashed the nodes i took 2011.3.0.
So i tried activating the log (batctl ll 4) but it its way too much to keep that running for more than a few seconds. If you relly need the complete log i have to setup a syslog-server to store that.
A short snippet: [ 25706] Received TT_RESPONSE from bat53 for ttvn 1 t_size: 1 [F] [ 25706] Deleting global tt entry 04:11:80:db:48:c8 (via 0a:18:84:1e:f6:05): originator time out [ 25706] Creating new global tt entry: 00:02:81:b9:4c:c8 (via 0a:18:84:1e:f6:05) [ 25707] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 20405 num_changes: 0) [ 25707] Sending TT_REQUEST to bat53 via bat51 [F] [ 25707] Received TT_RESPONSE from bat53 for ttvn 1 t_size: 1 [F] [ 25707] Deleting global tt entry 00:02:81:b9:4c:c8 (via 0a:18:84:1e:f6:05): originator time out [ 25707] Creating new global tt entry: 18:84:1e:f6:05:01 (via 0a:18:84:1e:f6:05) [ 25707] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 19671 num_changes: 0) [ 25707] Sending TT_REQUEST to bat53 via bat51 [F] [ 25707] Received TT_RESPONSE from bat53 for ttvn 1 t_size: 1 [F] [ 25707] Deleting global tt entry 18:84:1e:f6:05:01 (via 0a:18:84:1e:f6:05): originator time out [ 25707] Creating new global tt entry: 04:11:80:e4:a8:c8 (via 0a:18:84:1e:f6:05) [ 25708] TT inconsistency for 0a:18:84:26:35:9d. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 40594 last_crc: 53273 num_changes: 0)
Tobias
On Wed, Aug 31, 2011 at 07:05:31 +0200, Tobias wrote:
Sorry, I mixed the versions i installed. I have the problems with 2011.3.0. Before - I tried with 2011.2.0 but when i flashed the nodes i took 2011.3.0.
Ok :-)
[ 25706] Creating new global tt entry: 00:02:81:b9:4c:c8 (via 0a:18:84:1e:f6:05) [ 25707] Creating new global tt entry: 18:84:1e:f6:05:01 (via 0a:18:84:1e:f6:05) [ 25707] Creating new global tt entry: 04:11:80:e4:a8:c8 (via 0a:18:84:1e:f6:05)
Here there is something strange..Do you know where do these macs belong (00:02:81:b9:4c:c8, 04:11:80:db:48:c8, 18:84:1e:f6:05:01)? Do you have clients connected to the mesh through the batman-adv nodes?
Cheers, Antonio
On Wed, Aug 31, 2011 at 2:55 PM, Antonio Quartulli ordex@autistici.org wrote:
On Wed, Aug 31, 2011 at 07:05:31 +0200, Tobias wrote:
[ 25706] Creating new global tt entry: 00:02:81:b9:4c:c8 (via 0a:18:84:1e:f6:05) [ 25707] Creating new global tt entry: 18:84:1e:f6:05:01 (via 0a:18:84:1e:f6:05) [ 25707] Creating new global tt entry: 04:11:80:e4:a8:c8 (via 0a:18:84:1e:f6:05)
Here there is something strange..Do you know where do these macs belong (00:02:81:b9:4c:c8, 04:11:80:db:48:c8, 18:84:1e:f6:05:01)? Do you have clients connected to the mesh through the batman-adv nodes?
Couldn't help noticing that one mac is a one-byte right shift of the other:
(...) (via 0a:18:84:1e:f6:05) Creating (...) 18:84:1e:f6:05:01 ---------------------^----------------^
Javier
Am 31.08.2011 23:55, schrieb Antonio Quartulli:
[ 25706] Creating new global tt entry: 00:02:81:b9:4c:c8 (via 0a:18:84:1e:f6:05) [ 25707] Creating new global tt entry: 18:84:1e:f6:05:01 (via 0a:18:84:1e:f6:05) [ 25707] Creating new global tt entry: 04:11:80:e4:a8:c8 (via 0a:18:84:1e:f6:05)
Here there is something strange..Do you know where do these macs belong (00:02:81:b9:4c:c8, 04:11:80:db:48:c8, 18:84:1e:f6:05:01)? Do you have clients connected to the mesh through the batman-adv nodes?
Cheers, Antonio
I just looked on every node in the network and can't find any nodes with these macs. ifconfig | grep -i "4c:c8|05:01|48:c8" ; arp | grep -i "4c:c8|05:01|48:c8"
ATM there are only the mesh-nodes, one internet-router and a camera connected. The router is in the middle and the camera on one end. (3-4 hops from the internet-router)
Well, i can't reach the camera right now - i have to reset the node maybe...
When everything is working i get a HD-picture from the camera every second. So the connection is good (when it's working...)
I notices another mac 04:11:81:b9:42:c8 and looked in the log of node bat54: root@fon-53:~# batctl l | grep "42:c8" [ 87390] Creating new global tt entry: 00:02:81:b9:42:c8 (via 5e:e6:fc:ae:55:a8) [ 87390] Deleting global tt entry 00:02:81:b9:42:c8 (via 5e:e6:fc:ae:55:a8): originator time out
the via is from bat51 - unfortunately i dont have loggin on that node - i'll install a different image later
funny thing is bat49 which is directly connected to bat51... root@1043-49:~# batctl o [B.A.T.M.A.N. adv 2011.3.0, MainIF/MAC: wlan2/9e:0c:6d:ee:7c:ba (bat0)] Originator last-seen (#/255) Nexthop [outgoingIF]: Potential nexthops ... bat60 7.640s ( 79) bat51 [ wlan2]: bat59 ( 7) bat51 ( 79) bat58 0.160s ( 75) bat51 [ wlan2]: bat60 ( 0) bat59 ( 10) bat51 ( 75) bat51 0.190s (230) bat51 [ wlan2]: bat59 ( 0) bat51 (230) bat67 4.750s ( 83) bat51 [ wlan2]: bat59 ( 0) bat51 ( 83) bat53 0.330s ( 86) bat51 [ wlan2]: bat60 ( 0) bat59 ( 12) bat51 ( 86) bat52 4.070s (137) bat51 [ wlan2]: bat59 ( 4) bat51 (137) bat54 0.610s ( 83) bat51 [ wlan2]: bat59 ( 7) bat51 ( 83) bat59 9.220s ( 93) bat51 [ wlan2]: bat59 ( 10) bat51 ( 93) bat55 14.470s ( 81) bat51 [ wlan2]: bat59 ( 6) bat51 ( 81)
...does hear different broadcasts: root@1043-49:~# batctl l | grep "42:c8" [ 80123] Creating new global tt entry: 04:11:81:b9:42:c8 (via 0a:18:84:1e:f6:05) [ 80124] Deleting global tt entry 04:11:81:b9:42:c8 (via 0a:18:84:1e:f6:05): originator time out [ 80168] Creating new global tt entry: 00:02:81:b9:42:c8 (via 0a:18:84:1e:f6:05) [ 80168] Deleting global tt entry 00:02:81:b9:42:c8 (via 0a:18:84:1e:f6:05): originator time out [ 80182] Creating new global tt entry: 00:02:81:b9:42:c8 (via 0a:18:84:1e:f6:05) [ 80183] Deleting global tt entry 00:02:81:b9:42:c8 (via 0a:18:84:1e:f6:05): originator time out [ 80197] Creating new global tt entry: 00:02:81:b9:42:c8 (via 0a:18:84:1e:f6:05) [ 80198] Deleting global tt entry 00:02:81:b9:42:c8 (via 0a:18:84:1e:f6:05): originator time out
These "via"-mac points back to bat53 but in the log of bat53 i only see that they are comming "via" bat51.
There is nothing connected to bat51 or bat53 - both are wifi-only and nobody can connect to them atm (except the mesh)
Why does it create a global tt entry and delete it a second later?
Tobias
Hello,
On Thu, Sep 01, 2011 at 10:15:45 +0200, Tobias wrote:
I just looked on every node in the network and can't find any nodes with these macs. ifconfig | grep -i "4c:c8|05:01|48:c8" ; arp | grep -i "4c:c8|05:01|48:c8"
ATM there are only the mesh-nodes, one internet-router and a camera connected. The router is in the middle and the camera on one end. (3-4 hops from the internet-router)
ok. It seems there is a bug somewhere :-) Can you again copy/paste this creating/deleting part of the node log and tell me which is the correct bat0 MAC address of the node announcing the change?
E.g: on fon-53 you have this log:
[ 87390] Creating new global tt entry: 00:02:81:b9:42:c8 (via 5e:e6:fc:ae:55:a8) [ 87390] Deleting global tt entry 00:02:81:b9:42:c8 (via 5e:e6:fc:ae:55:a8): originator time out
then go on node 5e:e6:fc:ae:55:a8 and write down its bat0 MAC address, please. Moreover, is bat0 bridged with any other interface (e.g. bat0 + eth0 within br0?)
Why does it create a global tt entry and delete it a second later?
1 second is the OGM interval. It seems there is some inconsistency in the CRC check which is done on OGM receive (1 per sec). In case of inconsistency the node asks for an update. When receiving the response the node deletes the whole table (in this case) and refill it with the information contained in the response message.
Cheers, Antonio
Am 01.09.2011 17:55, schrieb Antonio Quartulli:
ok. It seems there is a bug somewhere :-) Can you again copy/paste this creating/deleting part of the node log and tell me which is the correct bat0 MAC address of the node announcing the change?
E.g: on fon-53 you have this log:
[ 87390] Creating new global tt entry: 00:02:81:b9:42:c8 (via 5e:e6:fc:ae:55:a8) [ 87390] Deleting global tt entry 00:02:81:b9:42:c8 (via 5e:e6:fc:ae:55:a8): originator time out
then go on node 5e:e6:fc:ae:55:a8 and write down its bat0 MAC address, please. Moreover, is bat0 bridged with any other interface (e.g. bat0 + eth0 within br0?)
sure: root@1043-49:~# batctl l | grep "0a:18:84:1e:f6:05" [ 125592] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 20443 num_changes: 0) [ 125592] Deleting global tt entry 04:11:80:ca:6a:c8 (via 0a:18:84:1e:f6:05): originator time out [ 125592] Creating new global tt entry: 04:11:80:df:ce:c8 (via 0a:18:84:1e:f6:05) [ 125593] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 19376 num_changes: 0) [ 125593] Deleting global tt entry 04:11:80:df:ce:c8 (via 0a:18:84:1e:f6:05): originator time out [ 125593] Creating new global tt entry: 04:11:80:2f:c6:c8 (via 0a:18:84:1e:f6:05) [ 125595] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 47287 num_changes: 0) [ 125595] Deleting global tt entry 04:11:80:2f:c6:c8 (via 0a:18:84:1e:f6:05): originator time out [ 125595] Creating new global tt entry: 04:11:80:50:3a:c8 (via 0a:18:84:1e:f6:05) [ 125597] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 24775 num_changes: 0) [ 125597] Creating new global tt entry: 04:11:80:2f:c6:c8 (via 0a:18:84:1e:f6:05)
root@fon-53:~# brctl show bridge name bridge id STP enabled interfaces br-lan 8000.0018841ef604 no eth0 edge1 wlan1 mesh 8000.0018841ef605 no wlan0 edge0
root@fon-53:~# ifconfig bat0 Link encap:Ethernet HWaddr DA:54:2C:E4:B9:1E inet addr:192.168.111.53 Bcast:192.168.111.255 Mask:255.255.255.0 br-lan Link encap:Ethernet HWaddr 00:18:84:1E:F6:04 inet addr:192.168.1.53 Bcast:192.168.1.255 Mask:255.255.255.0 edge0 Link encap:Ethernet HWaddr 0A:EA:66:19:D8:72 inet addr:10.0.2.53 Bcast:10.0.2.255 Mask:255.255.255.0 edge1 Link encap:Ethernet HWaddr 6A:D6:C4:C0:D6:76 inet addr:10.0.1.53 Bcast:10.0.1.255 Mask:255.255.255.0 eth0 Link encap:Ethernet HWaddr 00:18:84:1E:F6:04 mesh Link encap:Ethernet HWaddr 00:18:84:1E:F6:05 inet addr:10.0.0.53 Bcast:10.255.255.255 Mask:255.0.0.0 mon.wlan0 Link encap:UNSPEC HWaddr 00-18-84-1E-F6-05-00-47-00-00-00-00-00-00-00-00 wlan0 Link encap:Ethernet HWaddr 00:18:84:1E:F6:05 wlan1 Link encap:Ethernet HWaddr 06:18:84:1E:F6:05 wlan2 Link encap:Ethernet HWaddr 0A:18:84:1E:F6:05
As you can see we have a bridging to an n2n-tunnel but the tunnel is down ATM. And the other wlan-interfaces (0 and 1) are secured and no one can connect. (no RX-traffic)
I even removed the n2n-tunnel but the bat53 is still causing these inconsistencies.
root@1043-49:~# batctl l | grep "inconsis" [ 126479] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 47287 num_changes: 0) [ 126480] TT inconsistency for 0a:18:84:26:35:9d. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 40594 last_crc: 26555 num_changes: 0) [ 126480] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 26555 num_changes: 0) [ 126481] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 11185 num_changes: 0) [ 126481] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 24775 num_changes: 0) [ 126482] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 19671 num_changes: 0) [ 126483] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 16977 num_changes: 0) [ 126484] TT inconsistency for 0a:18:84:1e:f6:05. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 51609 last_crc: 26555 num_changes: 0) [ 126485] TT inconsistency for 0a:18:84:26:35:9d. Need to retrieve the correct information (ttvn: 1 last_ttvn: 1 crc: 40594 last_crc: 47287 num_changes: 0)
On some nodes i get a bunch of those inconsistencies within seconds on other nodes i don't get any within minutes.
Let me know if i can provide you any further infos that might help.
Tobias
On Thu, Sep 01, 2011 at 11:11:14 +0200, Tobias wrote:
root@1043-49:~# batctl l | grep "0a:18:84:1e:f6:05"
Can you give me the output of batctl tg and batctl tl on this node (1043-49)?
root@fon-53:~# brctl show
and ouput of batctl tl of this other node (fon-53)?
Because it seems that we are dealing with more than one bug, so I would like to distinguish the possible causes.
@Andrew: operation on mac addresses should be bytewise than the architecture should now be implied in these problems, no?
Thanks, Antonio
Am 01.09.2011 23:35, schrieb Antonio Quartulli:
On Thu, Sep 01, 2011 at 11:11:14 +0200, Tobias wrote:
root@1043-49:~# batctl l | grep "0a:18:84:1e:f6:05"
Can you give me the output of batctl tg and batctl tl on this node (1043-49)?
uncut output:
root@1043-49:~# batctl tg Globally announced TT entries received via the mesh bat0 Client (TTVN) Originator (Curr TTVN) * 42:c9:62:7e:91:f2 ( 1) via bat57 ( 1) * d6:0f:24:f1:43:3c ( 1) via bat59 ( 1) * 04:11:80:50:38:c8 ( 1) via bat53 ( 1) * 7a:7b:f4:ed:5c:24 ( 1) via bat52 ( 1) * 04:11:80:2f:c4:c8 ( 1) via bat67 ( 1) * be:57:9e:5b:e0:3e ( 1) via bat60 ( 1) * fa:45:8f:a5:2c:75 ( 1) via bat55 ( 1) * c2:90:a3:3b:4e:c9 ( 1) via bat58 ( 1) * ea:4b:dd:55:f3:1f ( 1) via bat54 ( 1) * 76:18:e9:ab:f9:40 ( 1) via bat51 ( 1)
root@1043-49:~# batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 1): * 9e:90:fc:dc:99:09
and ouput of batctl tl of this other node (fon-53)?
root@fon-53:~# batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 1): * da:54:2c:e4:b9:1e
Tobias
Tobias,
root@1043-49:~# batctl tg Globally announced TT entries received via the mesh bat0 Client (TTVN) Originator (Curr TTVN)
- 42:c9:62:7e:91:f2 ( 1) via bat57 ( 1)
- d6:0f:24:f1:43:3c ( 1) via bat59 ( 1)
- 04:11:80:50:38:c8 ( 1) via bat53 ( 1)
- 7a:7b:f4:ed:5c:24 ( 1) via bat52 ( 1)
- 04:11:80:2f:c4:c8 ( 1) via bat67 ( 1)
- be:57:9e:5b:e0:3e ( 1) via bat60 ( 1)
- fa:45:8f:a5:2c:75 ( 1) via bat55 ( 1)
- c2:90:a3:3b:4e:c9 ( 1) via bat58 ( 1)
- ea:4b:dd:55:f3:1f ( 1) via bat54 ( 1)
- 76:18:e9:ab:f9:40 ( 1) via bat51 ( 1)
root@1043-49:~# batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 1):
- 9e:90:fc:dc:99:09
and ouput of batctl tl of this other node (fon-53)?
root@fon-53:~# batctl tl Locally retrieved addresses (from bat0) announced via TT (TTVN: 1):
- da:54:2c:e4:b9:1e
I have tried to reproduce your issue but was unsuccessful. Could you help narrowing down the cause by providing some more information ? For instance: * What is the simpliest setup that can provoke the issue ? Does it also happen if you only involve 2 nodes ? Or 3 / 4 / 5 ? Does the problem go away if you reduce the number of interfaces on each node ? * You provided some logs showing additions / deletions of a non-existing mac address and you said this would go away after a while. What do the logs say when the problem is about to disappear ? * Are the nodes that often exhibit this flakiness far apart or direct neighbors ? What does the topology look like ? Again, the simplier the setup the better.
Regards, Marek
Am 05.09.2011 10:06, schrieb Marek Lindner:
I have tried to reproduce your issue but was unsuccessful. Could you help narrowing down the cause by providing some more information ? For instance:
- What is the simpliest setup that can provoke the issue ? Does it also happen
if you only involve 2 nodes ? Or 3 / 4 / 5 ? Does the problem go away if you reduce the number of interfaces on each node ?
- You provided some logs showing additions / deletions of a non-existing mac
address and you said this would go away after a while. What do the logs say when the problem is about to disappear ?
- Are the nodes that often exhibit this flakiness far apart or direct
neighbors ? What does the topology look like ? Again, the simplier the setup the better.
Regards, Marek
Hello Marek,
i needed to get this working so i switched to 2011.2 last weekend and the problems are gone.
So it seems it's a bug introduced in 2011.3.
The network on which i had the problems is live and in use, so i can't make any further tests on it right now.
I have some similar devices as spare parts. I'll try to set them up and reproduce the problems - but that can take a while.
What i can tell you is: - every node has only one wlan-network on which batman runs and only the node itself uses the bat0 interface - no bridging, no routing - at the end, most of the time only the directly connected nodes were reachable via a normal ping - the "hopping" was not working - i had the *feeling* that the problems got bigger the more nodes we added - also i *think* the problems began when we mixed Fonera- and TP-Link-devices (sightly different architecture i think) - the log was always spammed with messages - because of the tiny devices i use it's not easy to save the log and inspect it when the connection is working again - which can take an hour...
Thank you for your help. I'm sorry, that i can't provide you more infos to track down the bug.
Tobias
Hi,
i needed to get this working so i switched to 2011.2 last weekend and the problems are gone.
ok.
I have some similar devices as spare parts. I'll try to set them up and reproduce the problems - but that can take a while.
Understood. We hopefully come up with a patch soon that will need to be tested.
- every node has only one wlan-network on which batman runs and only the
node itself uses the bat0 interface - no bridging, no routing
- at the end, most of the time only the directly connected nodes were
reachable via a normal ping - the "hopping" was not working
- i had the *feeling* that the problems got bigger the more nodes we added
- also i *think* the problems began when we mixed Fonera- and
TP-Link-devices (sightly different architecture i think)
Can you tell us what TP-Link model you are using (name, hardware revision, etc). Or post the output of /proc/cpuinfo ?
- the log was always spammed with messages - because of the tiny devices
i use it's not easy to save the log and inspect it when the connection is working again - which can take an hour...
It is somewhat troublesome that the problem went away after a while. If we have a bug in the architecture handling you would expect it to be a permanent problem. But who knows ...
Regards, Marek
Am 06.09.2011 20:33, schrieb Marek Lindner:
Can you tell us what TP-Link model you are using (name, hardware revision, etc). Or post the output of /proc/cpuinfo ?
Hi Marek,
the TP-Links are 1043: root@1043-50:~# cat /proc/cpuinfo system type : Atheros AR9132 rev 2 machine : TP-LINK TL-WR1043ND processor : 0 cpu model : MIPS 24Kc V7.4 BogoMIPS : 265.42 wait instruction : yes microsecond timers : yes tlb_entries : 16 extra interrupt vector : yes hardware watchpoint : yes, count: 4, address/irw mask: [0x0000, 0x0ff8, 0x0ff8, 0x0ff8] ASEs implemented : mips16 shadow register sets : 1 kscratch registers : 0 core : 0 VCED exceptions : not available VCEI exceptions : not available
and 841: root@841-52:~# cat /proc/cpuinfo system type : Atheros AR7241 rev 1 machine : TP-LINK TL-WR741ND processor : 0 cpu model : MIPS 24Kc V7.4 BogoMIPS : 265.42 wait instruction : yes microsecond timers : yes tlb_entries : 16 extra interrupt vector : yes hardware watchpoint : yes, count: 4, address/irw mask: [0x0000, 0x08f8, 0x07c0, 0x0ba8] ASEs implemented : mips16 shadow register sets : 1 kscratch registers : 0 core : 0 VCED exceptions : not available VCEI exceptions : not available
the others are Fonera2100: root@fon-53:~# cat /proc/cpuinfo system type : Atheros AR2315 processor : 0 cpu model : MIPS 4KEc V6.4 BogoMIPS : 183.50 wait instruction : yes microsecond timers : yes tlb_entries : 16 extra interrupt vector : yes hardware watchpoint : no ASEs implemented : shadow register sets : 1 core : 0 VCED exceptions : not available VCEI exceptions : not available
Tobias
Hi,
the TP-Links are 1043: root@1043-50:~# cat /proc/cpuinfo system type : Atheros AR9132 rev 2 machine : TP-LINK TL-WR1043ND processor : 0 cpu model : MIPS 24Kc V7.4 BogoMIPS : 265.42 wait instruction : yes microsecond timers : yes tlb_entries : 16 extra interrupt vector : yes hardware watchpoint : yes, count: 4, address/irw mask: [0x0000, 0x0ff8, 0x0ff8, 0x0ff8] ASEs implemented : mips16 shadow register sets : 1 kscratch registers : 0 core : 0 VCED exceptions : not available VCEI exceptions : not available
thanks for providing the info. So far, we can't spot the issue no matter how long we stare at the code. I'd like to make some patches that will spit out additional debug info to isolate the cause. As we can't reproduce the problem at our end we depend on you to test these patches. Will it be possible for you to have a mix of your nodes running in debug mode ?
Cheers, Marek
On Tue, Sep 13, 2011 at 01:18:03PM +0200, Marek Lindner wrote:
Hi,
the TP-Links are 1043: root@1043-50:~# cat /proc/cpuinfo system type : Atheros AR9132 rev 2 machine : TP-LINK TL-WR1043ND processor : 0 cpu model : MIPS 24Kc V7.4 BogoMIPS : 265.42 wait instruction : yes microsecond timers : yes tlb_entries : 16 extra interrupt vector : yes hardware watchpoint : yes, count: 4, address/irw mask: [0x0000, 0x0ff8, 0x0ff8, 0x0ff8] ASEs implemented : mips16 shadow register sets : 1 kscratch registers : 0 core : 0 VCED exceptions : not available VCEI exceptions : not available
thanks for providing the info. So far, we can't spot the issue no matter how long we stare at the code.
My gut feeling is its either a compiler issue, a caching issue, or maybe packing of structures is somehow different between different architectures. However, you seems to only have MIPs based systems and i guess you use the same compiler on all platforms, so i lean towards a compiler problem....
Andrew
Am 13.09.2011 13:18, schrieb Marek Lindner:
thanks for providing the info. So far, we can't spot the issue no matter how long we stare at the code. I'd like to make some patches that will spit out additional debug info to isolate the cause. As we can't reproduce the problem at our end we depend on you to test these patches. Will it be possible for you to have a mix of your nodes running in debug mode ?
Hello Marek,
i flashed 5 Foneras and a TP-Link yesterday and today i tried to reproduce the problem on a separate network.
It's not easy to reproduce the problem here but if i try, reboot and "move" some devices, the "wrong MACs" also appear and some hops do not work.
So, when your patches are ready i can test with them.
Its no problem anymore to capture big logs here.
Tobias
On Wed, Sep 14, 2011 at 10:54:08PM +0200, Tobias wrote:
Hello Marek,
i flashed 5 Foneras and a TP-Link yesterday and today i tried to reproduce the problem on a separate network.
It's not easy to reproduce the problem here but if i try, reboot and "move" some devices, the "wrong MACs" also appear and some hops do not work.
So, when your patches are ready i can test with them.
Its no problem anymore to capture big logs here.
Tobias
Ok, little update:
I sent a little patch to Tobias in order to get a verbose log of the TT updating dialogue (sorry for sending it privately, but I thought it was not interesting for the list..I won't do it again :-) )
Now I'm going to further investigate on the issue..hoping the new log can help!
Cheers,
Am 31.08.2011 17:57, schrieb wayne abroue:
On Wed, Aug 31, 2011 at 11:55 AM, Marek Lindnerlindner_marek@yahoo.de wrote:
Hi,
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
since "batctl ping" works I'd say your mesh works fine - you have a problem in your higher layers. Maybe a mac address collision or an ARP timeout ?
Can you provide specific examples we can go through ? For instance, provide the batctl ping output to the neighbor in question, the ping error message (does it say timeout / host could not be found / etc), a batctl traceroute to the neighbor in question and the output of the global translation table.
Are you trying to ping a 'fixed' node or a node that is roaming ?
Regards, Marek
I'd also check signal strength,, have experienced this when levels are fluctuating, ie: batctl ping works, ip not. then comes back.
Wayne A
Hi Wayne,
the signal level is not very good - but it should be sufficient. (around 40-60%)
Even when i keep both pings running at the same time, only the "batctl p" works.
Tobias
On Wed, Aug 31, 2011 at 6:44 PM, Tobias tracer@robotech.de wrote:
Am 31.08.2011 17:57, schrieb wayne abroue:
On Wed, Aug 31, 2011 at 11:55 AM, Marek Lindnerlindner_marek@yahoo.de wrote:
Hi,
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
since "batctl ping" works I'd say your mesh works fine - you have a problem in your higher layers. Maybe a mac address collision or an ARP timeout ?
Can you provide specific examples we can go through ? For instance, provide the batctl ping output to the neighbor in question, the ping error message (does it say timeout / host could not be found / etc), a batctl traceroute to the neighbor in question and the output of the global translation table.
Are you trying to ping a 'fixed' node or a node that is roaming ?
Regards, Marek
I'd also check signal strength,, have experienced this when levels are fluctuating, ie: batctl ping works, ip not. then comes back.
Wayne A
Hi Wayne,
the signal level is not very good - but it should be sufficient. (around 40-60%)
Even when i keep both pings running at the same time, only the "batctl p" works.
Tobias
Thats your problem,, I have experienced the same in the past,, You will find access very erratic,, upgrade your antenna if you can.
Wayne A
Hi,
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
since "batctl ping" works I'd say your mesh works fine - you have a problem in your higher layers. Maybe a mac address collision or an ARP timeout ?
Can you provide specific examples we can go through ? For instance, provide the batctl ping output to the neighbor in question, the ping error message (does it say timeout / host could not be found / etc), a batctl traceroute to the neighbor in question and the output of the global translation table.
Are you trying to ping a 'fixed' node or a node that is roaming ?
Regards, Marek
Hello Marek,
thanks for you response. I'll try to give you an example - i'll cut out the parts that are not relevant (i hope).
First i have to correct the version - it seems to be 2011.3 - not 2011.2 as the subject says.
root@fon-58:~# dmesg | grep "batman_adv" batman_adv: B.A.T.M.A.N. advanced 2011.3.0 (compatibility version 14) loaded
The route from bat49 to bat58 is not working. It should hop via bat59.
root@1043-49:~# batctl o [B.A.T.M.A.N. adv 2011.3.0, MainIF/MAC: wlan2/9e:0c:6d:ee:7c:ba (bat0)] Originator last-seen (#/255) Nexthop [outgoingIF]: Potential nexthops ... bat58 3.080s (168) bat59 [ wlan2]: bat59 (168) bat51 ( 0) bat60 (127) bat59 3.130s (202) bat59 [ wlan2]: bat60 (155) bat51 (134) bat59 (202)
root@fon-59:~# batctl o [B.A.T.M.A.N. adv 2011.3.0, MainIF/MAC: wlan2/0a:18:84:80:87:9d (bat0)] Originator last-seen (#/255) Nexthop [outgoingIF]: Potential nexthops ... bat58 4.740s (210) bat58 [ wlan2]: bat55 ( 0) bat51 ( 0) bat49 ( 0) bat52 ( 0) bat67 (148) bat53 (120) bat54 (191) bat60 (170) bat58 (210) bat49 0.040s (192) bat49 [ wlan2]: bat52 ( 36) bat55 ( 0) bat67 (106) bat58 (148) bat54 (129) bat53 ( 80) bat60 (152) bat49 (192) bat51 (112)
root@fon-58:~# batctl o [B.A.T.M.A.N. adv 2011.3.0, MainIF/MAC: wlan2/0a:18:84:81:a1:0d (bat0)] Originator last-seen (#/255) Nexthop [outgoingIF]: Potential nexthops ... bat49 0.570s (174) bat59 [ wlan2]: bat51 ( 4) bat52 ( 9) bat55 ( 5) bat54 (149) bat53 (140) bat67 (156) bat60 ( 95) bat59 (174) bat49 ( 0) bat59 0.990s (245) bat59 [ wlan2]: bat55 ( 8) bat51 ( 3) bat52 ( 8) bat60 (137) bat53 (186) bat67 (217) bat54 (206) bat59 (245)
ifconfigs: root@1043-49:~# ifconfig bat0 bat0 Link encap:Ethernet HWaddr 9E:90:FC:DC:99:09 inet addr:192.168.111.49 Bcast:192.168.111.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:9410 errors:0 dropped:0 overruns:0 frame:0 TX packets:64693 errors:0 dropped:2560 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1622914 (1.5 MiB) TX bytes:13322553 (12.7 MiB) root@1043-49:~# ifconfig wlan2 wlan2 Link encap:Ethernet HWaddr 9E:0C:6D:EE:7C:BA UP BROADCAST RUNNING MULTICAST MTU:1528 Metric:1 RX packets:84071 errors:0 dropped:78 overruns:0 frame:0 TX packets:112446 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6693056 (6.3 MiB) TX bytes:20633620 (19.6 MiB)
root@fon-59:~# ifconfig bat0 bat0 Link encap:Ethernet HWaddr D6:0F:24:F1:43:3C inet addr:192.168.111.59 Bcast:192.168.111.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23493 errors:0 dropped:0 overruns:0 frame:0 TX packets:5078 errors:0 dropped:8 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4006301 (3.8 MiB) TX bytes:726585 (709.5 KiB) root@fon-59:~# ifconfig wlan2 wlan2 Link encap:Ethernet HWaddr 0A:18:84:80:87:9D UP BROADCAST RUNNING MULTICAST MTU:1528 Metric:1 RX packets:298487 errors:0 dropped:748 overruns:0 frame:0 TX packets:176654 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18665398 (17.7 MiB) TX bytes:14335354 (13.6 MiB)
root@fon-58:~# ifconfig bat0 bat0 Link encap:Ethernet HWaddr C2:90:A3:3B:4E:C9 inet addr:192.168.111.58 Bcast:192.168.111.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23159 errors:0 dropped:0 overruns:0 frame:0 TX packets:7759 errors:0 dropped:2298 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1115758 (1.0 MiB) TX bytes:737874 (720.5 KiB) root@fon-58:~# ifconfig wlan2 wlan2 Link encap:Ethernet HWaddr 0A:18:84:81:A1:0D UP BROADCAST RUNNING MULTICAST MTU:1528 Metric:1 RX packets:3475063 errors:0 dropped:1422 overruns:0 frame:0 TX packets:1601622 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:154462355 (147.3 MiB) TX bytes:100565468 (95.9 MiB)
this is working: root@1043-49:~# batctl p bat59 PING bat59 (0a:18:84:80:87:9d) 20(48) bytes of data 20 bytes from bat59 icmp_seq=1 ttl=49 time=6.12 ms
and this: root@1043-49:~# batctl p bat58 PING bat58 (0a:18:84:81:a1:0d) 20(48) bytes of data 20 bytes from bat58 icmp_seq=1 ttl=48 time=17.61 ms
and this too: root@1043-49:~# ping 192.168.111.59 PING 192.168.111.59 (192.168.111.59): 56 data bytes 64 bytes from 192.168.111.59: seq=0 ttl=64 time=7.621 ms
this NOT: root@1043-49:~# ping 192.168.111.58 PING 192.168.111.58 (192.168.111.58): 56 data bytes
the route seems ok: root@1043-49:~# batctl tr bat58 traceroute to bat58 (0a:18:84:81:a1:0d), 50 hops max, 20 byte packets 1: bat59 (0a:18:84:80:87:9d) 4.297 ms 31.777 ms 0.938 ms 2: bat58 (0a:18:84:81:a1:0d) 7.868 ms 4.153 ms 3.352 ms
I see the pings going out on bat49 root@1043-49:~# batctl td wlan2 | grep "ICMP" 13:16:39.026715 BAT bat49 > bat58: UCAST, ttvn 1, ttl 50, IP 192.168.111.49 > 192.168.111.58: ICMP echo request, id 9467, seq 16, length 64
i even see the packet come into bat58: root@fon-58:~# batctl td wlan2 | grep "ICMP" 13:18:39.715935 BAT bat59 > bat58: UCAST, ttvn 1, ttl 48, IP 192.168.111.49 > 192.168.111.58: ICMP echo request, id 9467, seq 159, length 64
but no reply.
in the bat0-interface i can see the reply: root@fon-58:~# batctl td bat0 | grep "ICMP" 13:19:15.730081 IP 192.168.111.49 > 192.168.111.58: ICMP echo request, id 9467, seq 195, length 64 13:19:15.732864 IP 192.168.111.58 > 192.168.111.49: ICMP echo reply, id 9467, seq 195, length 64
the arp-table of bat58 looks good: root@fon-58:~# arp -a IP address HW type Flags HW address Mask Device 192.168.111.49 0x1 0x2 9e:90:fc:dc:99:09 * bat0
the other direction does not work either: root@fon-58:~# ping 192.168.111.49 PING 192.168.111.49 (192.168.111.49): 56 data bytes
the packet go out on bat58 on the bat0 interface root@fon-58:~# batctl td bat0 | grep "ICMP" 13:54:15.727222 IP 192.168.111.58 > 192.168.111.49: ICMP echo request, id 1961, seq 112, length 64
but it its *NOT* visible in the wlan-interface: root@fon-58:~# batctl td wlan2 | grep "ICMP"
A ping from bat58 to bat59 works: root@fon-58:~# ping 192.168.111.59 PING 192.168.111.59 (192.168.111.59): 56 data bytes 64 bytes from 192.168.111.59: seq=0 ttl=64 time=15.729 ms
and appears in both dumps: root@fon-58:~# batctl td wlan2 | grep "ICMP" 14:00:50.522992 BAT bat58 > bat59: UCAST, ttvn 1, ttl 50, IP 192.168.111.58 > 192.168.111.59: ICMP echo request, id 1997, seq 3, length 64 14:00:50.530158 BAT bat59 > bat58: UCAST, ttvn 1, ttl 50, IP 192.168.111.59 > 192.168.111.58: ICMP echo reply, id 1997, seq 3, length 64
root@fon-58:~# batctl td bat0 | grep "ICMP" 14:01:05.563243 IP 192.168.111.58 > 192.168.111.59: ICMP echo request, id 1997, seq 18, length 64 14:01:05.567195 IP 192.168.111.59 > 192.168.111.58: ICMP echo reply, id 1997, seq 18, length 64
Why is the ICMP-Ping from 58 to 49 not send on the wlan?
Does the "TX-dropped" count in ifconfig mean anything?
I dont't understand the "batctl tg". If i repeat the command it gives me different results:
root@fon-58:~# batctl tg |grep "49" * 04:11:80:f4:40:c8 ( 1) via bat49 ( 1) root@fon-58:~# batctl tg |grep "49" * 0c:6d:ee:7c:ba:01 ( 1) via bat49 ( 1) root@fon-58:~# batctl tg |grep "49" * 04:11:80:f4:40:c8 ( 1) via bat49 ( 1) root@fon-58:~# batctl tg |grep "49" * 18:84:80:34:51:01 ( 1) via bat49 ( 1) root@fon-58:~# batctl tg |grep "49" * 04:30:48:60:6c:dd ( 1) via bat49 ( 1) root@fon-58:~# batctl tg |grep "49" * 04:11:80:f4:40:c8 ( 1) via bat49 ( 1)
i have not yet found a device with the mac "04:11:80:f4:40:c8"
if i look in the logs on bat49 it keeps creating and deleting an enrty with this address: root@1043-49:~# batctl l | grep "40:c8" [ 9726] Creating new global tt entry: 04:11:80:f4:40:c8 (via 0a:18:84:1e:f6:05) [ 9726] Deleting global tt entry 04:11:80:f4:40:c8 (via 0a:18:84:1e:f6:05): originator time out
The nodes are fixed an not moving. Do i have to specify them as non-roaming somehow?
We have problems with the correct / same time on all devices. Is that a problem for batman?
Tobias
Hi all, Hi Tobias,
On Wed, Aug 31, 2011 at 12:53:23AM +0200, Tobias wrote:
Hi all,
i have setup a mesh-network with batman-adv running on about 10 Foneras and 4 TP-Link on OpenWRT.
At first everything seemed to work. A node on the one end could ping a node on the other end over the mesh-network. The ping was hopping from node to node as expected.
But sometimes some paths do not work anymore.
Some nodes can only reach their direct neighbors via a "normal ping". A ping to a node via one hop does not work. A "batctl ping" does work!
This only happens to parts of the network and is not permanent. If i wait it will recover, but then the problem appears at another node.
dmesg or the syslog does not report any errors.
Can anyone give me a hint where to look?
Tobias
Yesterday batman-adv-2011.3.1 has been released. As I verified with Laurent (tests are still going on), this version should fix the bug reported into this thread.
It would be appreciated if any of the people affected by this bug could test the new release and give us any feedback!
Cheers,
b.a.t.m.a.n@lists.open-mesh.org