Hi Marek
I finally had time to dig into our problems with loops in our chain.
Some background for the list. Yang and I have been using User Mode
Linux (UML) to build a test network for batman advanced. We connect a
number of uml machines together using a modified version of
uml_switch. The modifications allow us to change the packet drop
probability between any two nodes. We have been testing using simple
chains as shown in the attached gif. The black lines show the
currently used links. The red lines are other links which are
currently not used by batman. The black links have a packet drop
probablilty of 0% and the red of 20%.
Our test was to remove uml5 from the network and see how long
batman-adv took to re-route around it. We ping from uml4 to uml6 and
from uml1 to uml9.
We found that uml4->uml6 would recover in around 14 seconds. However
uml1->uml9 took much longer, 65 seconds.
Looking at the routing, we found it went into loops. When sending from
uml1 to uml9, uml1 routes to uml2, uml2 routes back to uml1.
Here are the logs from uml2. I've cut out most of the packets, just
showing OGMs from uml9. There is a simple relationship between the MAC
address and the uml number:
fe:fe:00:00:01:01 - uml1
fe:fe:00:00:02:01 - uml2
fe:fe:00:00:03:01 - uml3
etc...
[ 42949558] Received BATMAN packet via NB: fe:fe:00:00:03:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:04:01, seqno 146, tq 218, TTL 44, V 7, IDF 0)
[ 42949558] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:03:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 218
[ 42949558] update_originator(): Searching and updating originator entry of received packet
[ 42949558] Updating existing last-hop neighbour of originator
[ 42949558] Drop packet: duplicate packet received
This has been received from uml3 origionally from uml4. The TQ is 218
to uml9 via uml3.
[ 42949559] Received BATMAN packet via NB: fe:fe:00:00:01:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:03:01, seqno 146, tq 209, TTL 42, V 7, IDF 0)
[ 42949559] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:01:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 209
[ 42949559] update_originator(): Searching and updating originator entry of received packet
[ 42949559] Updating existing last-hop neighbour of originator
[ 42949559] Drop packet: duplicate packet received
This is where is starts to get interesting. This is from uml1,
origionally from uml3. So it has jumped uml2, it used the 20% packet
drop link which exists between uml1 and uml3. Because this is not an
echo, uml2 processes it, and now knows that with a TQ of 209 it can
get to uml9 via uml1.
[ 42949559] Sending own packet (originator fe:fe:00:00:02:01, seqno 155, TQ 255, TTL 50, IDF off) on interface eth1 [fe:fe:00:00:02:01]
[ 42949559] Forwarding aggregated packet (originator fe:fe:00:00:06:01, seqno 152, TQ 232, TTL 46, IDF off) on interface eth1 [fe:fe:00:00:02:01]
[ 42949559] Forwarding aggregated packet (originator fe:fe:00:00:09:01, seqno 146, TQ 215, TTL 43, IDF off) on interface eth1 [fe:fe:00:00:02:01]
[ 42949559] Forwarding packet (originator fe:fe:00:00:01:01, seqno 156, TQ 250, TTL 49, IDF on) on interface eth1 [fe:fe:00:00:02:01]
[ 42949560] Received BATMAN packet via NB: fe:fe:00:00:03:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:04:01, seqno 148, tq 150, TTL 45, V 7, IDF 0)
[ 42949560] updating last_seqno: old 146, new 148
[ 42949560] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:03:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 150
[ 42949560] update_originator(): Searching and updating originator entry of received packet
[ 42949560] Updating existing last-hop neighbour of originator
[ 42949560] Changing route towards: fe:fe:00:00:09:01 (now via fe:fe:00:00:01:01 - was via fe:fe:00:00:03:01)
[ 42949560] Forwarding packet: rebroadcast originator packet
[ 42949560] Forwarding packet: tq_orig: 150, tq_avg: 209, tq_forw: 204, ttl_orig: 44, ttl_forw: 255
Now things go none optimal :-(
This is from uml3, origionally from uml4. The TQ value has dropped to
150. This will be when we have removed uml5, so the TQ naturally does
drop.
The TQ value via uml3 is now less than the TQ value via uml1. So it
changes its route to go via uml1.
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2.
The problem here i think is to do with the asymetric links algorithms.
When sending out an OGM, the node uses the TQ for its best link to the
originator, not the link the OGM came in on. If the OGM from uml1
origionally from UML3 reported the TQ via that route, the TQ would
very likely be lower. uml2 would then not of choosen to swap to
uml1. However, uml1 reports its best route, which is via uml2. uml2
does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all
right...
How would you suggest fix this?
Thanks
Andrew