Hi Marek
I finally had time to dig into our problems with loops in our chain.
Some background for the list. Yang and I have been using User Mode Linux (UML) to build a test network for batman advanced. We connect a number of uml machines together using a modified version of uml_switch. The modifications allow us to change the packet drop probability between any two nodes. We have been testing using simple chains as shown in the attached gif. The black lines show the currently used links. The red lines are other links which are currently not used by batman. The black links have a packet drop probablilty of 0% and the red of 20%.
Our test was to remove uml5 from the network and see how long batman-adv took to re-route around it. We ping from uml4 to uml6 and from uml1 to uml9.
We found that uml4->uml6 would recover in around 14 seconds. However uml1->uml9 took much longer, 65 seconds.
Looking at the routing, we found it went into loops. When sending from uml1 to uml9, uml1 routes to uml2, uml2 routes back to uml1.
Here are the logs from uml2. I've cut out most of the packets, just showing OGMs from uml9. There is a simple relationship between the MAC address and the uml number:
fe:fe:00:00:01:01 - uml1 fe:fe:00:00:02:01 - uml2 fe:fe:00:00:03:01 - uml3 etc...
[ 42949558] Received BATMAN packet via NB: fe:fe:00:00:03:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:04:01, seqno 146, tq 218, TTL 44, V 7, IDF 0) [ 42949558] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:03:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 218 [ 42949558] update_originator(): Searching and updating originator entry of received packet [ 42949558] Updating existing last-hop neighbour of originator [ 42949558] Drop packet: duplicate packet received
This has been received from uml3 origionally from uml4. The TQ is 218 to uml9 via uml3.
[ 42949559] Received BATMAN packet via NB: fe:fe:00:00:01:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:03:01, seqno 146, tq 209, TTL 42, V 7, IDF 0) [ 42949559] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:01:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 209 [ 42949559] update_originator(): Searching and updating originator entry of received packet [ 42949559] Updating existing last-hop neighbour of originator [ 42949559] Drop packet: duplicate packet received
This is where is starts to get interesting. This is from uml1, origionally from uml3. So it has jumped uml2, it used the 20% packet drop link which exists between uml1 and uml3. Because this is not an echo, uml2 processes it, and now knows that with a TQ of 209 it can get to uml9 via uml1.
[ 42949559] Sending own packet (originator fe:fe:00:00:02:01, seqno 155, TQ 255, TTL 50, IDF off) on interface eth1 [fe:fe:00:00:02:01] [ 42949559] Forwarding aggregated packet (originator fe:fe:00:00:06:01, seqno 152, TQ 232, TTL 46, IDF off) on interface eth1 [fe:fe:00:00:02:01] [ 42949559] Forwarding aggregated packet (originator fe:fe:00:00:09:01, seqno 146, TQ 215, TTL 43, IDF off) on interface eth1 [fe:fe:00:00:02:01] [ 42949559] Forwarding packet (originator fe:fe:00:00:01:01, seqno 156, TQ 250, TTL 49, IDF on) on interface eth1 [fe:fe:00:00:02:01]
[ 42949560] Received BATMAN packet via NB: fe:fe:00:00:03:01, IF: eth1 [fe:fe:00:00:02:01] (from OG: fe:fe:00:00:09:01, via old OG: fe:fe:00:00:04:01, seqno 148, tq 150, TTL 45, V 7, IDF 0) [ 42949560] updating last_seqno: old 146, new 148 [ 42949560] bidirectional: orig = fe:fe:00:00:09:01 neigh = fe:fe:00:00:03:01 => own_bcast = 64, real recv = 64, local tq: 255, asym_penalty: 255, total tq: 150 [ 42949560] update_originator(): Searching and updating originator entry of received packet [ 42949560] Updating existing last-hop neighbour of originator [ 42949560] Changing route towards: fe:fe:00:00:09:01 (now via fe:fe:00:00:01:01 - was via fe:fe:00:00:03:01) [ 42949560] Forwarding packet: rebroadcast originator packet [ 42949560] Forwarding packet: tq_orig: 150, tq_avg: 209, tq_forw: 204, ttl_orig: 44, ttl_forw: 255
Now things go none optimal :-(
This is from uml3, origionally from uml4. The TQ value has dropped to 150. This will be when we have removed uml5, so the TQ naturally does drop.
The TQ value via uml3 is now less than the TQ value via uml1. So it changes its route to go via uml1.
Looking at the logs of uml1, uml1 is always routing to uml9 via uml2. The problem here i think is to do with the asymetric links algorithms. When sending out an OGM, the node uses the TQ for its best link to the originator, not the link the OGM came in on. If the OGM from uml1 origionally from UML3 reported the TQ via that route, the TQ would very likely be lower. uml2 would then not of choosen to swap to uml1. However, uml1 reports its best route, which is via uml2. uml2 does not know this, decides to use uml1, and we have a loop.
Does this all hang together correctly? I'm i interpreting this all right...
How would you suggest fix this?
Thanks Andrew