Re: [B.A.T.M.A.N.] batman goes looping...

30 Jul 2009


      -  More information about the UML based test setup: a chain consisting of 9
nodes (each node is an uml instatnce, uml1, uml2 ... uml9). Direct
neighbors (distance=1 hop) can talk to each other with 0% link loss rate.
Nodes that are 2 hop away can talk to each other with 20% link loss rate.
Nodes that are more than 2 hops away cannot communicate to each other. We
shut down the node in the middle (uml5) and measure (with ping) the time it
takes for batman to re-setup the end-to-end routes (from uml1 to uml9).
-  We observe extensive tmp routing loops during the rerouting events. It
often takes batman more than 1 minutes to get any packet through after the
rerouting and it takes even longer (several minutes) before batman
stabilizes.
-  The cause for the loops:
Let's study the interaction between uml1 and uml2:
OGMs from uml9 may arrive at uml1 and uml2 at roughly the same time
(because both of them has a shortest path of 6-hop from uml9).  uml1 and
uml2 will forward those OGMs to each other.  Depending on the timing
relations between uml1 and uml2, there are four possible combinations:
Case 1: first uml1 forwards the OGM received from uml3 to uml2,  then  uml
2 forwards the OGM received from uml3 to uml1.
Case 2: first uml1 forwards the OGM received from uml3 to uml2, then uml 2
forwards the OGM received from uml1 back to uml1.
Case 3: first uml2 forwards the OGM received from uml3 to uml1, then uml1
forwards the OGM received from uml3 to uml2.
Case 4: first uml2 forwards the OGM received from uml3 to uml1, then uml1
forwards the OGM received from uml2 back to uml2.
When the quality of the path from uml1 to uml9 drops down dramatically
(e.g., in our case, path loss rate increases from 0% to 20% after we remove
uml5), 3 out of 4 cases (case 1,2,4) may cause looping between uml1 and
uml2. With the interaction of echo cancellation mechanism, case 2 may even
lead batman into very deep looping.
- To fix the problem:
I explored two ideas:
1.  Similar to what Marek proposed. But push it to the extreme: switch to a
neighbor only when it has the best tq AND it has the newest seqno.  This
fix along seems to already solve the looping problem in the test cases. It
reduces the rerouting time from more than 1 minutes to less than 15
seconds.  Marek: I also tried the patch you sent. It didn't help in this
setup.
2. Relaxed echo cancellation. This is based on  the following observation:
the TQ value that a node puts into OGM is completely decoupled with "from
which neighbor this OGM is received".   As a result, the TQ value contained
in the echoed OGM represent the real TQ value at the neighbor which echoed
this OGM. The current echo cancellation implementation just drops all the
echoed OGM. This may prevent the node from updating the information towards
the neighbor that echos the OGM. In the extreme case, the information
towards that neighbor may becomes completely stale (similar to what happens
in case 2).  The change I  made:  Always check the TQ contained in the
echoed OGMs. When it is worse than the avg TQ towards that neighbor, we use
this TQ reading to update the avg TQ towards that neighbor. This change
didn't show any effect during the chain tests. However, I still include
this change in the patch to bring up the discussion.
In the attachment, I append a patch which contains a preliminary
implementation for both ideas.
Cheers,
Yang
(See attached file: routing-loops.patch)
|------------>
| From:      |
|------------>
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Marek Lindner lindner_marek@yahoo.de                                                                                                            |
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|b.a.t.m.a.n@lists.open-mesh.net                                                                                                                   |
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Cc:        |
|------------>
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Andrew Lunn andrew.lunn@ascom.ch, Yang Su yang.su@ascom.ch                                                                                    |
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|27.07.2009 18:21                                                                                                                                  |
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
|Re: [B.A.T.M.A.N.] batman goes looping...                                                                                                         |
...
--------------------------------------------------------------------------------------------------------------------------------------------------|
Hi,
...
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2.
The problem here i think is to do with the asymetric links algorithms.
When sending out an OGM, the node uses the TQ for its best link to the
originator, not the link the OGM came in on. If the OGM from uml1
origionally from UML3 reported the TQ via that route, the TQ would
very likely be lower. uml2 would then not of choosen to swap to
uml1. However, uml1 reports its best route, which is via uml2. uml2
does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all
right...
How would you suggest fix this?
I would tend to say it is a bug in the routing selection code. UML2
switches
the route because it compare thes "negative" TQ message of seqno N with a
TQ
of seqno N - x. I suggest the following fix:
If we receive a TQ value via a neighbor that is smaller than the previous
TQ
that we received via that neighbor we don't change the route to a neighbor
which did not send us the same or newer seqno.
That way your scenario should not happen because uml2 would not switch. On
the
other hand if uml1 really has a better route uml2 would switch as soon as
the
packet with a new seqno via uml1 arrives. What do you think ?
I attached a patch that should do exactly that. As I'm travelling right now
I'm not able to test it before next week. If you find the time to do so,
please
let me know about your findings. This patch is not as strict as it could
be. It
might be necessary to rework it as soon as the concept has been proven.
Thanks again for this thorough analysis. Let us know if you find more.  :-)
Regards,
Marek

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [B.A.T.M.A.N.] batman goes looping...