- More information about the UML based test setup: a chain consisting of 9 nodes (each node is an uml instatnce, uml1, uml2 ... uml9). Direct neighbors (distance=1 hop) can talk to each other with 0% link loss rate. Nodes that are 2 hop away can talk to each other with 20% link loss rate. Nodes that are more than 2 hops away cannot communicate to each other. We shut down the node in the middle (uml5) and measure (with ping) the time it takes for batman to re-setup the end-to-end routes (from uml1 to uml9).
- We observe extensive tmp routing loops during the rerouting events. It often takes batman more than 1 minutes to get any packet through after the rerouting and it takes even longer (several minutes) before batman stabilizes.
- The cause for the loops:
Let's study the interaction between uml1 and uml2:
OGMs from uml9 may arrive at uml1 and uml2 at roughly the same time (because both of them has a shortest path of 6-hop from uml9). uml1 and uml2 will forward those OGMs to each other. Depending on the timing relations between uml1 and uml2, there are four possible combinations:
Case 1: first uml1 forwards the OGM received from uml3 to uml2, then uml 2 forwards the OGM received from uml3 to uml1.
Case 2: first uml1 forwards the OGM received from uml3 to uml2, then uml 2 forwards the OGM received from uml1 back to uml1.
Case 3: first uml2 forwards the OGM received from uml3 to uml1, then uml1 forwards the OGM received from uml3 to uml2.
Case 4: first uml2 forwards the OGM received from uml3 to uml1, then uml1 forwards the OGM received from uml2 back to uml2.
When the quality of the path from uml1 to uml9 drops down dramatically (e.g., in our case, path loss rate increases from 0% to 20% after we remove uml5), 3 out of 4 cases (case 1,2,4) may cause looping between uml1 and uml2. With the interaction of echo cancellation mechanism, case 2 may even lead batman into very deep looping.
- To fix the problem:
I explored two ideas:
1. Similar to what Marek proposed. But push it to the extreme: switch to a neighbor only when it has the best tq AND it has the newest seqno. This fix along seems to already solve the looping problem in the test cases. It reduces the rerouting time from more than 1 minutes to less than 15 seconds. Marek: I also tried the patch you sent. It didn't help in this setup.
2. Relaxed echo cancellation. This is based on the following observation: the TQ value that a node puts into OGM is completely decoupled with "from which neighbor this OGM is received". As a result, the TQ value contained in the echoed OGM represent the real TQ value at the neighbor which echoed this OGM. The current echo cancellation implementation just drops all the echoed OGM. This may prevent the node from updating the information towards the neighbor that echos the OGM. In the extreme case, the information towards that neighbor may becomes completely stale (similar to what happens in case 2). The change I made: Always check the TQ contained in the echoed OGMs. When it is worse than the avg TQ towards that neighbor, we use this TQ reading to update the avg TQ towards that neighbor. This change didn't show any effect during the chain tests. However, I still include this change in the patch to bring up the discussion.
In the attachment, I append a patch which contains a preliminary implementation for both ideas.
Cheers, Yang (See attached file: routing-loops.patch)
|------------> | From: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | To: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n@lists.open-mesh.net |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Cc: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Andrew Lunn andrew.lunn@ascom.ch, Yang Su yang.su@ascom.ch |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Date: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|27.07.2009 18:21 |
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------> | Subject: | |------------>
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping... |
--------------------------------------------------------------------------------------------------------------------------------------------------|
Hi,
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2. The problem here i think is to do with the asymetric links algorithms. When sending out an OGM, the node uses the TQ for its best link to the originator, not the link the OGM came in on. If the OGM from uml1 origionally from UML3 reported the TQ via that route, the TQ would very likely be lower. uml2 would then not of choosen to swap to uml1. However, uml1 reports its best route, which is via uml2. uml2 does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all right...
How would you suggest fix this?
I would tend to say it is a bug in the routing selection code. UML2 switches the route because it compare thes "negative" TQ message of seqno N with a TQ of seqno N - x. I suggest the following fix:
If we receive a TQ value via a neighbor that is smaller than the previous TQ that we received via that neighbor we don't change the route to a neighbor which did not send us the same or newer seqno.
That way your scenario should not happen because uml2 would not switch. On the other hand if uml1 really has a better route uml2 would switch as soon as the packet with a new seqno via uml1 arrives. What do you think ?
I attached a patch that should do exactly that. As I'm travelling right now
I'm not able to test it before next week. If you find the time to do so, please let me know about your findings. This patch is not as strict as it could be. It might be necessary to rework it as soon as the concept has been proven.
Thanks again for this thorough analysis. Let us know if you find more. :-)
Regards, Marek