Hi Marek
I finally had time to dig into our problems with loops in our chain.
Some background for the list. Yang and I have been using User Mode Linux (UML) to build a test network for batman advanced. We connect a number of uml machines together using a modified version of uml_switch. The modifications allow us to change the packet drop probability between any two nodes. We have been testing using simple chains as shown in the attached gif. The black lines show the currently used links. The red lines are other links which are currently not used by batman. The black links have a packet drop probablilty of 0% and the red of 20%.
Our test was to remove uml5 from the network and see how long batman-adv took to re-route around it. We ping from uml4 to uml6 and from uml1 to uml9.
We found that uml4->uml6 would recover in around 14 seconds. However uml1->uml9 took much longer, 65 seconds.
Looking at the routing, we found it went into loops. When sending from uml1 to uml9, uml1 routes to uml2, uml2 routes back to uml1.
Here are the logs from uml2. I've cut out most of the packets, just showing OGMs from uml9. There is a simple relationship between the MAC address and the uml number:
fe:fe:00:00:01:01 - uml1 fe:fe:00:00:02:01 - uml2 fe:fe:00:00:03:01 - uml3 etc...
[ 42949558] Received BATMAN packet via NB: fe:fe:00:00:03:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:04:01, seqno 146, tq 218, TTL 44, V 7, IDF 0) [ 42949558] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:03:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 218 [ 42949558] update_originator(): Searching and updating originator entry of received packet [ 42949558] Updating existing last-hop neighbour of originator [ 42949558] Drop packet: duplicate packet received
This has been received from uml3 origionally from uml4. The TQ is 218 to uml9 via uml3.
[ 42949559] Received BATMAN packet via NB: fe:fe:00:00:01:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:03:01, seqno 146, tq 209, TTL 42, V 7, IDF 0) [ 42949559] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:01:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 209 [ 42949559] update_originator(): Searching and updating originator entry of received packet [ 42949559] Updating existing last-hop neighbour of originator [ 42949559] Drop packet: duplicate packet received
This is where is starts to get interesting. This is from uml1, origionally from uml3. So it has jumped uml2, it used the 20% packet drop link which exists between uml1 and uml3. Because this is not an echo, uml2 processes it, and now knows that with a TQ of 209 it can get to uml9 via uml1.
[ 42949559] Sending own packet (originator fe:fe:00:00:02:01, seqno 155, TQ 255, TTL 50, IDF off) on interface eth1 [fe:fe:00:00:02:01] [ 42949559] Forwarding aggregated packet (originator fe:fe:00:00:06:01, seqno 152, TQ 232, TTL 46, IDF off) on interface eth1 [fe:fe:00:00:02:01] [ 42949559] Forwarding aggregated packet (originator fe:fe:00:00:09:01, seqno 146, TQ 215, TTL 43, IDF off) on interface eth1 [fe:fe:00:00:02:01] [ 42949559] Forwarding packet (originator fe:fe:00:00:01:01, seqno 156, TQ 250, TTL 49, IDF on) on interface eth1 [fe:fe:00:00:02:01]
[ 42949560] Received BATMAN packet via NB: fe:fe:00:00:03:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:04:01, seqno 148, tq 150, TTL 45, V 7, IDF 0) [ 42949560] updating last_seqno: old 146, new 148 [ 42949560] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:03:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 150 [ 42949560] update_originator(): Searching and updating originator entry of received packet [ 42949560] Updating existing last-hop neighbour of originator [ 42949560] Changing route towards: fe:fe:00:00:09:01 (now via fe:fe:00:00:01:01 - was via fe:fe:00:00:03:01) [ 42949560] Forwarding packet: rebroadcast originator packet [ 42949560] Forwarding packet: tq_orig: 150, tq_avg: 209, tq_forw: 204, ttl_orig: 44, ttl_forw: 255
Now things go none optimal :-(
This is from uml3, origionally from uml4. The TQ value has dropped to 150. This will be when we have removed uml5, so the TQ naturally does drop.
The TQ value via uml3 is now less than the TQ value via uml1. So it changes its route to go via uml1.
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2. The problem here i think is to do with the asymetric links algorithms. When sending out an OGM, the node uses the TQ for its best link to the originator, not the link the OGM came in on. If the OGM from uml1 origionally from UML3 reported the TQ via that route, the TQ would very likely be lower. uml2 would then not of choosen to swap to uml1. However, uml1 reports its best route, which is via uml2. uml2 does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all right...
How would you suggest fix this?
Thanks Andrew
Hi,
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2. The problem here i think is to do with the asymetric links algorithms. When sending out an OGM, the node uses the TQ for its best link to the originator, not the link the OGM came in on. If the OGM from uml1 origionally from UML3 reported the TQ via that route, the TQ would very likely be lower. uml2 would then not of choosen to swap to uml1. However, uml1 reports its best route, which is via uml2. uml2 does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all right...
How would you suggest fix this?
I would tend to say it is a bug in the routing selection code. UML2 switches the route because it compare thes "negative" TQ message of seqno N with a TQ of seqno N - x. I suggest the following fix:
If we receive a TQ value via a neighbor that is smaller than the previous TQ that we received via that neighbor we don't change the route to a neighbor which did not send us the same or newer seqno.
That way your scenario should not happen because uml2 would not switch. On the other hand if uml1 really has a better route uml2 would switch as soon as the packet with a new seqno via uml1 arrives. What do you think ?
I attached a patch that should do exactly that. As I'm travelling right now I'm not able to test it before next week. If you find the time to do so, please let me know about your findings. This patch is not as strict as it could be. It might be necessary to rework it as soon as the concept has been proven.
Thanks again for this thorough analysis. Let us know if you find more. :-)
Regards, Marek
- More information about the UML based test setup: a chain consisting of 9 nodes (each node is an uml instatnce, uml1, uml2 ... uml9). Direct neighbors (distance=1 hop) can talk to each other with 0% link loss rate. Nodes that are 2 hop away can talk to each other with 20% link loss rate. Nodes that are more than 2 hops away cannot communicate to each other. We shut down the node in the middle (uml5) and measure (with ping) the time it takes for batman to re-setup the end-to-end routes (from uml1 to uml9).
- We observe extensive tmp routing loops during the rerouting events. It often takes batman more than 1 minutes to get any packet through after the rerouting and it takes even longer (several minutes) before batman stabilizes.
- The cause for the loops:
Let's study the interaction between uml1 and uml2:
OGMs from uml9 may arrive at uml1 and uml2 at roughly the same time (because both of them has a shortest path of 6-hop from uml9). uml1 and uml2 will forward those OGMs to each other. Depending on the timing relations between uml1 and uml2, there are four possible combinations:
Case 1: first uml1 forwards the OGM received from uml3 to uml2, then uml 2 forwards the OGM received from uml3 to uml1.
Case 2: first uml1 forwards the OGM received from uml3 to uml2, then uml 2 forwards the OGM received from uml1 back to uml1.
Case 3: first uml2 forwards the OGM received from uml3 to uml1, then uml1 forwards the OGM received from uml3 to uml2.
Case 4: first uml2 forwards the OGM received from uml3 to uml1, then uml1 forwards the OGM received from uml2 back to uml2.
When the quality of the path from uml1 to uml9 drops down dramatically (e.g., in our case, path loss rate increases from 0% to 20% after we remove uml5), 3 out of 4 cases (case 1,2,4) may cause looping between uml1 and uml2. With the interaction of echo cancellation mechanism, case 2 may even lead batman into very deep looping.
- To fix the problem:
I explored two ideas:
1. Similar to what Marek proposed. But push it to the extreme: switch to a neighbor only when it has the best tq AND it has the newest seqno. This fix along seems to already solve the looping problem in the test cases. It reduces the rerouting time from more than 1 minutes to less than 15 seconds. Marek: I also tried the patch you sent. It didn't help in this setup.
2. Relaxed echo cancellation. This is based on the following observation: the TQ value that a node puts into OGM is completely decoupled with "from which neighbor this OGM is received". As a result, the TQ value contained in the echoed OGM represent the real TQ value at the neighbor which echoed this OGM. The current echo cancellation implementation just drops all the echoed OGM. This may prevent the node from updating the information towards the neighbor that echos the OGM. In the extreme case, the information towards that neighbor may becomes completely stale (similar to what happens in case 2). The change I made: Always check the TQ contained in the echoed OGMs. When it is worse than the avg TQ towards that neighbor, we use this TQ reading to update the avg TQ towards that neighbor. This change didn't show any effect during the chain tests. However, I still include this change in the patch to bring up the discussion.
In the attachment, I append a patch which contains a preliminary implementation for both ideas.
Cheers, Yang (See attached file: routing-loops.patch)
|------------> | From: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | To: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Cc: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Andrew Lunn andrew.lunn@ascom.ch, Yang Su yang.su@ascom.ch |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Date: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|27.07.2009 18:21 |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Subject: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping... |
--------------------------------------------------------------------------------------------------------------------------------------------------|
Hi,
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2. The problem here i think is to do with the asymetric links algorithms. When sending out an OGM, the node uses the TQ for its best link to the originator, not the link the OGM came in on. If the OGM from uml1 origionally from UML3 reported the TQ via that route, the TQ would very likely be lower. uml2 would then not of choosen to swap to uml1. However, uml1 reports its best route, which is via uml2. uml2 does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all right...
How would you suggest fix this?
I would tend to say it is a bug in the routing selection code. UML2 switches the route because it compare thes "negative" TQ message of seqno N with a TQ of seqno N - x. I suggest the following fix:
If we receive a TQ value via a neighbor that is smaller than the previous TQ that we received via that neighbor we don't change the route to a neighbor which did not send us the same or newer seqno.
That way your scenario should not happen because uml2 would not switch. On the other hand if uml1 really has a better route uml2 would switch as soon as the packet with a new seqno via uml1 arrives. What do you think ?
I attached a patch that should do exactly that. As I'm travelling right now
I'm not able to test it before next week. If you find the time to do so, please let me know about your findings. This patch is not as strict as it could be. It might be necessary to rework it as soon as the concept has been proven.
Thanks again for this thorough analysis. Let us know if you find more. :-)
Regards, Marek
On Thursday 30 July 2009 22:32:04 Yang Su wrote:
- Similar to what Marek proposed. But push it to the extreme: switch to a
neighbor only when it has the best tq AND it has the newest seqno. This fix along seems to already solve the looping problem in the test cases. It reduces the rerouting time from more than 1 minutes to less than 15 seconds. Marek: I also tried the patch you sent. It didn't help in this setup.
Changing the route based on the (fastest) sequence number has some drawbacks which we experienced before (BATMAN III). It tends to discriminate longer but better routes and favors short paths. In some asymetric link (worst case) scenarios it would route against all odds, simply because receiving a fast packet (newest seqno) does not mean we can send the same way back. If possible I'd like to avoid this strict seqno check.
I attached another patch which will conduct a route switch only if the TQ of the sending neighbor is better than our current best route (no negative switching anymore). I tested the patch here and it works so far but my environment is less controlled than yours. Could you perform the same test on your setup ?
If you intend to experiment with other ideas watch out for the sequence number as it will overflow. A simple "greater than" check might lead to strange results. Also, updates_routes() checks for changed HNA messages even if the next hop does not change.
- Relaxed echo cancellation. This is based on the following observation:
the TQ value that a node puts into OGM is completely decoupled with "from which neighbor this OGM is received". As a result, the TQ value contained in the echoed OGM represent the real TQ value at the neighbor which echoed this OGM. The current echo cancellation implementation just drops all the echoed OGM. This may prevent the node from updating the information towards the neighbor that echos the OGM. In the extreme case, the information towards that neighbor may becomes completely stale (similar to what happens in case 2). The change I made: Always check the TQ contained in the echoed OGMs. When it is worse than the avg TQ towards that neighbor, we use this TQ reading to update the avg TQ towards that neighbor. This change didn't show any effect during the chain tests. However, I still include this change in the patch to bring up the discussion.
Right, every node will emit his currently best TQ value but I did not understand how we can use that. If we send him a better TQ he will send back that number. If we send a bad TQ he will send his good number. Furthermore, each hop will apply some asymetric / hop / wifi penalty that we pull into our routing database ?
Regards, Marek
Hi,
I have 3 linksys wrt54 flashed with openwrt 7.09, webif'ed and installed BATMAN 0.2-rv478.1, all OK I think.
I have arranged the 3 nodes such that node 1 can only see 2 , and 2 only 3, so no direct contact is available from 1 to 3 (except via the mesh)
I have not broken the br-lan configuration of these devices and assigned 172.16.1.1/24 to node 1 172.16.1.2/24 to node 2 172.16.1.3/24 to node 3, so I think that means all packets recvd on wl0 or eh0 are bridged.
I have a pc hooked into the lan port of node 3 From here I can ssh into node3 (via the ethernet) and into node2 via the wifi (but not .1) Netstat -rn on node 3 indicates that 172.16.1.1 (node1) is via 172.16.1.2 and this is confirmed by the batman web if that shows .1 is reached via .2 - all great.
From the ssh prompt on .3 I can ping .2 with no loss and ping .1 with ~ 6% loss. However, if I try to ping .1 from the pc connected on the ethernet port of .3 I get no route to host. Putting wireshark on the ethernet ifc shows the arp going out but nothing coming back. Interestingly enough if I ssh into .3 and ping .1, with wireshark running on the ethernet of .3 Is see the outgoing ICMP but not the reply (which is there because the ping succeeds) . It seems as if the local host traffic is not truly bridged.
This is the first attempt so I might have some incorrect/missing config or done something stupid, any ideas would be welcome!
Thanks very much
nick
_________________________________________________________________ Windows Live Messenger: Celebrate 10 amazing years with free winks and emoticons. http://clk.atdmt.com/UKM/go/157562755/direct/01/
|------------> | From: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | To: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|"The list for a Better Approach To Mobile Ad-hoc Networking" b.a.t.m.a.n@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Date: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|03.08.2009 12:23 |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Subject: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping... |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Sent by: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n-bounces@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
On Thursday 30 July 2009 22:32:04 Yang Su wrote:
- Similar to what Marek proposed. But push it to the extreme: switch to
a
neighbor only when it has the best tq AND it has the newest seqno. This fix along seems to already solve the looping problem in the test cases.
It
reduces the rerouting time from more than 1 minutes to less than 15 seconds. Marek: I also tried the patch you sent. It didn't help in this setup.
Changing the route based on the (fastest) sequence number has some drawbacks which we experienced before (BATMAN III). It tends to discriminate longer but better routes and favors short paths. In some asymetric link (worst case) scenarios it would route against all odds, simply because receiving a fast packet (newest seqno) does not mean we can send the same way back. If possible I'd like to avoid this strict seqno check.
- You are right. Under certain circumstance, this approach may lead to less optimal route selection. Just to understand batman algorithm better, I try to make an example (please correct me if something's wrong here): There are two direct neighbors via which a sender communicates with a receiver. Via the good neighbor, their is a single good route with long delay. Between the bad neighbor and the receiver there are many parallel bad routes. Each bad route has high packet loss rate but very short delay. The receiver's OGMs arrive at the sender first via the bad neighbor. Although there is substantial OGM loss on each individual bad routes, by aggregation over those parallel bad routes, the bad neighbor is still able to relay most of sender's OGM to the receiver. As a result, if the sender already chooses the bad neighbor as the next hop, it will stick to it for quite long time and does not switch to the good neighbor.
I attached another patch which will conduct a route switch only if the TQ of the sending neighbor is better than our current best route (no negative switching anymore). I tested the patch here and it works so far but my environment is less controlled than yours. Could you perform the same test on your setup ?
- This approach achieves the similar goal as the seqno-based one: switch to a neighbor only when this neighbor is better and has more up-to-date information than the current best route. But this approach is less constrained and seems not to suffer from the above problem for the seqno-based approach. I have tested the patch on our test environment. It works pretty well on the test cases.
- Relaxed echo cancellation. This is based on the following
observation:
the TQ value that a node puts into OGM is completely decoupled with "from which neighbor this OGM is received". As a result, the TQ value
contained
in the echoed OGM represent the real TQ value at the neighbor which
echoed
this OGM. The current echo cancellation implementation just drops all the echoed OGM. This may prevent the node from updating the information
towards
the neighbor that echos the OGM. In the extreme case, the information towards that neighbor may becomes completely stale (similar to what
happens
in case 2). The change I made: Always check the TQ contained in the echoed OGMs. When it is worse than the avg TQ towards that neighbor, we
use
this TQ reading to update the avg TQ towards that neighbor. This change didn't show any effect during the chain tests. However, I still include this change in the patch to bring up the discussion.
Right, every node will emit his currently best TQ value but I did not understand how we can use that. If we send him a better TQ he will send back that number. If we send a bad TQ he will send his good number. Furthermore,
each hop will apply some asymetric / hop / wifi penalty that we pull into our routing database ?
- The problem might happen when the neighbor thinks that I am his best neighbor (his good number is actually from me). In this case, my estimation of that neighbor's TQ is actually decided by my own TQ (my TQ minus one hop penalty). When my TQ drops down, the correct behavior should be that my estimation of that neighbor's TQ also drops down accordingly. However, when echo cancellation steps in, my estimation of that neighbor's TQ will not be updated and remains to be the stale value. This might cause problems (e.g. looping) in corner cases even with the new patch. Does this analysis make sense or I miss some details of the echo cancellation?
Regards, Marek
[attachment "switch-route-with-better-tq.patch" deleted by Yang Su/NSUYAN/CH/Ascom] _______________________________________________ B.A.T.M.A.N mailing list B.A.T.M.A.N@lists.open-mesh.net https://lists.open-mesh.net/mm/listinfo/b.a.t.m.a.n
On Thursday 06 August 2009 16:44:28 Yang Su wrote:
- You are right. Under certain circumstance, this approach may lead to less
optimal route selection. Just to understand batman algorithm better, I try to make an example (please correct me if something's wrong here): There are two direct neighbors via which a sender communicates with a receiver. Via the good neighbor, their is a single good route with long delay. Between the bad neighbor and the receiver there are many parallel bad routes. Each bad route has high packet loss rate but very short delay. The receiver's OGMs arrive at the sender first via the bad neighbor. Although there is substantial OGM loss on each individual bad routes, by aggregation over those parallel bad routes, the bad neighbor is still able to relay most of sender's OGM to the receiver. As a result, if the sender already chooses the bad neighbor as the next hop, it will stick to it for quite long time and does not switch to the good neighbor.
It is possble to simplify your example more to clearly illustrate the problem. Scenario: Host A wants to send data to host B. Consider the beautiful graphic I attached. ;-) The link between A and B is quite asymetric. No packet loss from B to A but heavy packet loss the other way round. Now, B periodically sends its OGMs and all OGMs will arrive at A directly and via the neighbor whereas the OGMs via the neighbor always be slower than the directly received ones. A should choose the longer path via the neighbor because it has no packet loss but the "fastest sequence number wins" algorithm won't allow that.
- This approach achieves the similar goal as the seqno-based one: switch to
a neighbor only when this neighbor is better and has more up-to-date information than the current best route. But this approach is less constrained and seems not to suffer from the above problem for the seqno-based approach. I have tested the patch on our test environment. It works pretty well on the test cases.
Great! I comitted the patch immediately. Thanks again for your thorough analysis and ideas. :-)
- The problem might happen when the neighbor thinks that I am his best
neighbor (his good number is actually from me). In this case, my estimation of that neighbor's TQ is actually decided by my own TQ (my TQ minus one hop penalty). When my TQ drops down, the correct behavior should be that my estimation of that neighbor's TQ also drops down accordingly. However, when echo cancellation steps in, my estimation of that neighbor's TQ will not be updated and remains to be the stale value. This might cause problems (e.g. looping) in corner cases even with the new patch. Does this analysis make sense or I miss some details of the echo cancellation?
Unless you found another bug the echo cancellation should not hinder the propagation of our own TQ value. Once we (uml2) received a packet we will rebroadcast it. Our neighbor (uml1) will receive it (unless the packet got lost) and update its routing tables accordingly. Now, he will rebroadcast the packet as well which will be killed by the echo cancellation on uml1 as this packet does not contain new information. But even if the TQ received via uml2 is worse uml1 might still have a good TQ from uml3 in its routing table which might get rebroadcasted.
Regards, Marek
- This approach achieves the similar goal as the seqno-based one: switch
to
a neighbor only when this neighbor is better and has more up-to-date information than the current best route. But this approach is less constrained and seems not to suffer from the above problem for the seqno-based approach. I have tested the patch on our test environment. It works pretty well on the test cases.
Great! I comitted the patch immediately. Thanks again for your thorough analysis and ideas. :-)
- Some updates: I did more thorough testing. It turns out that this patch does really solve the problem. This method seems not able to completely eliminate all possible forms of looping which happen in the test cases. This result in an special pattern: the rerouting is fast (~15 seconds), then (after 3~ 5 seconds) the network enters the looping state and it normally takes long time to recover. Last time when I did tests, I stop the tests right after I observed the successful rerouting. I didn't look into the cause of this problem yet. Any inputs are welcome.
- The problem might happen when the neighbor thinks that I am his best
neighbor (his good number is actually from me). In this case, my
estimation
of that neighbor's TQ is actually decided by my own TQ (my TQ minus one
hop
penalty). When my TQ drops down, the correct behavior should be that my estimation of that neighbor's TQ also drops down accordingly. However,
when
echo cancellation steps in, my estimation of that neighbor's TQ will not
be
updated and remains to be the stale value. This might cause problems
(e.g.
looping) in corner cases even with the new patch. Does this analysis
make
sense or I miss some details of the echo cancellation?
Unless you found another bug the echo cancellation should not hinder the propagation of our own TQ value. Once we (uml2) received a packet we will rebroadcast it. Our neighbor (uml1) will receive it (unless the packet got lost) and update its routing tables accordingly. Now, he will rebroadcast the packet as well which will be killed by the echo cancellation on uml1 as this packet does not contain new information. But even if the TQ received via uml2 is worse uml1 might still have a good TQ from uml3 in its routing table which might get rebroadcasted.
- In the following example (also appended in the attachment :) ), if I understand the current echo cancellation implementation correctly, batman will enter permanent looping between A and B. In this example, A send to F, all the links are perfect and have the same delay. Only exception is link A-E. It is an asymmetric link.
Regards, Yang (See attached file: looping.pdf)
On Friday 07 August 2009 23:13:46 Yang Su wrote:
- Some updates: I did more thorough testing. It turns out that this patch
does really solve the problem. This method seems not able to completely eliminate all possible forms of looping which happen in the test cases. This result in an special pattern: the rerouting is fast (~15 seconds), then (after 3~ 5 seconds) the network enters the looping state and it normally takes long time to recover. Last time when I did tests, I stop the tests right after I observed the successful rerouting. I didn't look into the cause of this problem yet. Any inputs are welcome.
Can you provide logs from the nodes involved ? You also can send them off-list to avoid sending too big attachments.
- In the following example (also appended in the attachment :) ), if I
understand the current echo cancellation implementation correctly, batman will enter permanent looping between A and B. In this example, A send to F, all the links are perfect and have the same delay. Only exception is link A-E. It is an asymmetric link.
Why do you think it would loop permanently and why do you think it is the fault of the echo cancellation ? Actually, this example should[tm] be pretty easy. Every time the OGM from E or F arrive at A via the asymetric link A will apply a severe asymetric link penalty. All nodes will route via the longer path. Did you try to run this scenario in your testbed ?
Regards, Marek
For the echo cancellation example: the problem happens with the interaction between A and B. Because of the asym link, A chooses B as its next neighbor towards F. Because the asym path is short, B always receives F's OGM from A first (before B receives the OGM with the same seqno from C). As a result, from A's point of view, F's OGMs rebroadcasted by B are always echoes and A will never update the TQ information towards F via B.
When the quality of any links on the path B to F drops down, B might choose A as its next neighbor towards F and a loop forms between A and B.
Regards, Yang
On Friday 07 August 2009 23:13:46 Yang Su wrote:
- Some updates: I did more thorough testing. It turns out that this patch
does really solve the problem. This method seems not able to completely eliminate all possible forms of looping which happen in the test cases. This result in an special pattern: the rerouting is fast (~15 seconds), then (after 3~ 5 seconds) the network enters the looping state and it normally takes long time to recover. Last time when I did tests, I stop the tests right after I observed the successful rerouting. I didn't look into the cause of this problem yet. Any inputs are welcome.
Can you provide logs from the nodes involved ? You also can send them off-list to avoid sending too big attachments.
- In the following example (also appended in the attachment :) ), if I
understand the current echo cancellation implementation correctly, batman will enter permanent looping between A and B. In this example, A send to
F,
all the links are perfect and have the same delay. Only exception is link A-E. It is an asymmetric link.
Why do you think it would loop permanently and why do you think it is the fault of the echo cancellation ? Actually, this example should[tm] be pretty easy. Every time the OGM from E or F arrive at A via the asymetric link A will apply a severe asymetric link penalty. All nodes will route via the longer path. Did you try to run this scenario in your testbed ?
Regards, Marek
_______________________________________________ B.A.T.M.A.N mailing list B.A.T.M.A.N@lists.open-mesh.net https://lists.open-mesh.net/mm/listinfo/b.a.t.m.a.n
On Monday 10 August 2009 15:20:28 Yang Su wrote:
For the echo cancellation example: the problem happens with the interaction between A and B. Because of the asym link, A chooses B as its next neighbor towards F. Because the asym path is short, B always receives F's OGM from A first (before B receives the OGM with the same seqno from C). As a result, from A's point of view, F's OGMs rebroadcasted by B are always echoes and A will never update the TQ information towards F via B.
When the quality of any links on the path B to F drops down, B might choose A as its next neighbor towards F and a loop forms between A and B.
Ah, here we have the misunderstanding. Let me briefly outline how the echo cancellation works (I refer to your PDF-example):
1. Node C emits a packet. 2. Node B receives it and writes the address of Node C in the previous sender field because it received that packet via C. Now, B rebroadcasts the OGM. 3. Node C receives the packet it sent before and discards it as the packet contains its own address in the previous sender field. 4. Node A happily receives the packet (and goes back to step 2).
The echo cancellation will not kill the packets coming via the longer path.
Does this help ?
Regards, Marek
|------------> | From: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | To: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|"The list for a Better Approach To Mobile Ad-hoc Networking" b.a.t.m.a.n@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Date: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|10.08.2009 09:50 |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Subject: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping... |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Sent by: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n-bounces@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
On Monday 10 August 2009 15:20:28 Yang Su wrote:
For the echo cancellation example: the problem happens with the
interaction
between A and B. Because of the asym link, A chooses B as its next
neighbor
towards F. Because the asym path is short, B always receives F's OGM from
A
first (before B receives the OGM with the same seqno from C). As a
result,
from A's point of view, F's OGMs rebroadcasted by B are always echoes and
A
will never update the TQ information towards F via B.
When the quality of any links on the path B to F drops down, B might
choose
A as its next neighbor towards F and a loop forms between A and B.
Ah, here we have the misunderstanding. Let me briefly outline how the echo cancellation works (I refer to your PDF-example):
1. Node C emits a packet. 2. Node B receives it and writes the address of Node C in the previous sender field because it received that packet via C. Now, B rebroadcasts the OGM.
-- B has already rebroadcasted an earlier OGM with the same sequence number (received from A). Will B rebroadcast the OGM with the same sequence number (received from C) again?
3. Node C receives the packet it sent before and discards it as the packet contains its own address in the previous sender field. 4. Node A happily receives the packet (and goes back to step 2).
Cheers, Yang
On Monday 10 August 2009 15:57:29 Yang Su wrote:
-- B has already rebroadcasted an earlier OGM with the same sequence number (received from A). Will B rebroadcast the OGM with the same sequence number (received from C) again?
Excellent question - it should. I will check the code later to verify that this is the case. However, this is not related to the echo cancellation anymore.
Regards, Marek
On Monday 10 August 2009 16:16:49 Marek Lindner wrote:
Excellent question - it should. I will check the code later to verify that this is the case. However, this is not related to the echo cancellation anymore.
I inspected the code for behaviour in the case you outlined. I think our dropping policy might be a bit too strict. I suggest the following behaviour:
* New OGMs are processed and rebroadcasted as usual. * Known OGMs (the sequence number is not new but we did not hear this sequence number via that neighbor) are processed and rebroadcasted as well. * Duplicates (known sequence number via a neighbor that sent us this sequence number before are dropped.
Objections / ideas ?
Regards, Marek
Given the DV nature of Batman, current strict dropping policy seems to make sense, because sending known OGMs (as suggested) does introduce significant overhead but does not propagate so much new information.
In addition, loosing the dropping policy may introduce new problems which might not be easily predictable at this moment, e.g., stability, looping (propagating stale information within the network might cause even more nasty looping problems).
Another way (and maybe simpler ) to fix the problem: instead of loosing the OGM dropping policy, loosing the echo cancellation policy.
It is a very simple fix: in addition to the usual echo cancellation process, I also check the TQ value contained in the echoed OGM. This TQ is announced by the neighbor that broadcasts this echo. If this TQ is worse than my current TQ estimation towards that neighbor, I update the estimation with this TQ.
Does that make sense?
Regards, Yang
|------------> | From: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | To: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|"The list for a Better Approach To Mobile Ad-hoc Networking" b.a.t.m.a.n@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Date: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|11.08.2009 18:44 |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Subject: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping... |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Sent by: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n-bounces@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
On Monday 10 August 2009 16:16:49 Marek Lindner wrote:
Excellent question - it should. I will check the code later to verify
that
this is the case. However, this is not related to the echo cancellation anymore.
I inspected the code for behaviour in the case you outlined. I think our dropping policy might be a bit too strict. I suggest the following behaviour:
* New OGMs are processed and rebroadcasted as usual. * Known OGMs (the sequence number is not new but we did not hear this sequence number via that neighbor) are processed and rebroadcasted as well. * Duplicates (known sequence number via a neighbor that sent us this sequence number before are dropped.
Objections / ideas ?
Regards, Marek
_______________________________________________ B.A.T.M.A.N mailing list B.A.T.M.A.N@lists.open-mesh.net https://lists.open-mesh.net/mm/listinfo/b.a.t.m.a.n
On Wednesday 12 August 2009 22:34:17 Yang Su wrote:
Given the DV nature of Batman, current strict dropping policy seems to make sense, because sending known OGMs (as suggested) does introduce significant overhead but does not propagate so much new information.
In addition, loosing the dropping policy may introduce new problems which might not be easily predictable at this moment, e.g., stability, looping (propagating stale information within the network might cause even more nasty looping problems).
I agree that this might introduce new problems which is the reason for posting it here. I definitely see the overhead created by that methog. Every better idea will be accepted immediately. :-)
Another way (and maybe simpler ) to fix the problem: instead of loosing the OGM dropping policy, loosing the echo cancellation policy.
It is a very simple fix: in addition to the usual echo cancellation process, I also check the TQ value contained in the echoed OGM. This TQ is announced by the neighbor that broadcasts this echo. If this TQ is worse than my current TQ estimation towards that neighbor, I update the estimation with this TQ.
In your testbed it fixed the issue ? Looking at the example you gave I would expect that the problem is only solved partially. You argued the OGMs from E via A would always arrive faster at B compared to the longer path. Using your method would probably avoid the loop between A and B in the event of a packet loss on the longer path but it would not let route A via the longer path. Am I mistaken ?
Regards, Marek
Maybe I miss some thing here. What do you think would prevent A from choosing via the long route (via B) ?
Cheers, Yang
|------------> | From: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | To: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|"The list for a Better Approach To Mobile Ad-hoc Networking" b.a.t.m.a.n@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Date: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|13.08.2009 11:59 |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Subject: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping... |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Sent by: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n-bounces@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
On Wednesday 12 August 2009 22:34:17 Yang Su wrote:
Given the DV nature of Batman, current strict dropping policy seems to
make
sense, because sending known OGMs (as suggested) does introduce
significant
overhead but does not propagate so much new information.
In addition, loosing the dropping policy may introduce new problems which might not be easily predictable at this moment, e.g., stability, looping (propagating stale information within the network might cause even more nasty looping problems).
I agree that this might introduce new problems which is the reason for posting it here. I definitely see the overhead created by that methog. Every better
idea will be accepted immediately. :-)
Another way (and maybe simpler ) to fix the problem: instead of loosing
the
OGM dropping policy, loosing the echo cancellation policy.
It is a very simple fix: in addition to the usual echo cancellation process, I also check the TQ value contained in the echoed OGM. This TQ is announced by the neighbor that broadcasts this echo. If this TQ is
worse
than my current TQ estimation towards that neighbor, I update the estimation with this TQ.
In your testbed it fixed the issue ? Looking at the example you gave I would expect that the problem is only solved partially. You argued the OGMs from E via A would always arrive faster at B
compared to the longer path. Using your method would probably avoid the loop between A and B in the event of a packet loss on the longer path but it would not let route A via the longer path. Am I mistaken ?
Regards, Marek _______________________________________________ B.A.T.M.A.N mailing list B.A.T.M.A.N@lists.open-mesh.net https://lists.open-mesh.net/mm/listinfo/b.a.t.m.a.n
On Thursday 13 August 2009 23:07:06 Yang Su wrote:
Maybe I miss some thing here. What do you think would prevent A from choosing via the long route (via B) ?
The starting point (still using your example): * A points to E as it never learned about the long path. * B points to C because it learned about the long path but does not rebroadcast the packets from C because they are slower than the packets via A. * C uses the longer path.
Your idea: Use the echo cancellation's answer to retrieve more information, especially when the TQ which is coming back is worse than my own I update my routing information.
The situation now: * A still points to E because the echos coming from B do not contain TQ values that are worse than its own. * B will probably never route via A because the TQ values are much worse. * C is not affected.
Did I overlook something ?
Regards, Marek
b.a.t.m.a.n@lists.open-mesh.org