Hello!
I also want to thank you a lot for your good comments. And by the way add my two cents (mostly inline) :-) I also want to point out that the draft does not represent the latest state of our implementations. When we started to write the draft we had to start with something well tested and understood. So we decided to to take the protocol as implemented in batman-0.2 as a basis. Some time has passed since then...
On Dienstag 05 August 2008, Juliusz Chroboczek wrote:
Dear all,
For some time now, I've been wanting to comment on the BATMAN protocol. The following are my conclusions from a careful reading of the BATMAN draft dated 7 April 2008, as well as a number of enlightening conversions with Aaron Kaplan and Roman Steiner, to both of whom many thanks.
BATMAN is an eminently original protocol, and one that contains a number of very interesting ideas. However, it is my opinion that in its current state, BATMAN has a number of flaws that are serious enough to warrant a redesign.
The above statement must be taken with a pinch of salt, for at least two reasons. First, there are a number of subtle points in BATMAN that I do not fully understand, so my assessment may be wrong. Second, it is not clear to me to what extent the BATMAN implementation corresponds to the draft; it may be the case that the implementation fixes some of the issues outlined below in ways that are not described in the draft.
Summary
The following are the issues I see with BATMAN, in roughly decreasing order of importance:
- BATMAN's convergence is exponential in the diameter of the network in the presence of packet loss.
I agree! In case of packet loss, there is the issue with the exponentially decreasing probability for an OGM to make it up to the nth hop. And in fact, this problem is faced by every other IP packet as well. Usually you dont want to use a tcp session to a distant node over multiple lossy links because chances for you TCP/IP packets to reach its final destination exponentially decrease with every hop. This way, there could be a natural mechanism to not maintain routes to nodes which you do not want to communicate with anyway. Unfortunately, the exponents of these two exponential decreasing examples are different. The reason for the different exponent is that the 802.11 stack usually retransmits unacknowledged unicast packets several times before giving up. BatMan-eXperimental (BMX) aims to approximate the real exponent by also sending OGMs multiple times in independent broadcast datagrams. If packet aggregation is enabled (see later on) the additional protocol-traffic overhead does not hurt to much because the second OGM transmissions can simply be attached to the next independent OGM aggregation send by this node.
- BATMAN does not contain any loop avoidance mechanisms; in the presence of so-called ``optimistic routing'', BATMAN may exhibit persistent routing loops. Even without optimistic routing, BATMAN will exhibit transitory routing loops.
Here I completely agree with what Simon Wunderlich said. Personally, I am not aware of any scenario that can lead to transient routing loops, not even to a dead node! You might know better, then pleas let me know :-)
Each batman node is only supposed to rebroadcast an OGM which it has received from the neighbor which it has configured itself as its best nexthop to the destination. Therefore, the OGM itself must have travelled a valid path with all traversed nodes having a loop free route configured towards the originator of the OGM. Of course, the configured end-to-end route might break down somewhere along the path just after an OGM has been received. This will cause a temporary dead route. But not a loop. Later on, only a newer OGM (with a larger sequence number), and which once again must have traversed a functional and configured path, can reconfigure any node receiving this OGM.
Another remark: The batmand-0.2 implementation is not doing this completely correct and thus can cause temporary routing loops. The reason is, that it also reconfigures its best nexthop due to new OGMs which were not received via its currently *best-known* nexthop.
- The metric used by BATMAN is not realistic, and will cause it to strongly favour long hops over multiple short ones and to favour asymmetric links taken ``the wrong way''.
There has been a lot of discussion about this shortcomings of batmand-0.2 during the WCW2007 in Graz [1,2,3] and the WCW2008 in Berlin. The MARK IV algorithms and 0.3 implementations consider this shortcoming.
- Since BATMAN does not perform any aggregation of messages into packets, it is inefficient on link technologies with an important per-frame overhead (such as IEEE 802.11).
Two week ago I had the chance to look around in the Leipzig Freifunk net. They have recently integrated batman-experimental with packet aggregation by default into the leipzig freifunk firmware (the aggregation wraps upto 25 OGM into a sigle 300 bytes packet). By that date the randomly examined nodes showed about 150 other batman nodes in their routing table (but there were more batman nodes beyond the individual horizon of the examined nodes). The firmware also collects statistics about routing protocol-traffic overhead. The statistics showed a pretty much similar traffic overhead to olsr. In dense areas the overhead of bmx (with an originator interval of 1 second) seemed even a bit less than that of olsr while at the edges of the mesh it was the other way around.
- Jitter is not compulsory, which may lead to persistent series of collisions.
It is indeed a problem in hidden node scenarios where 802.11 jitter does not help ([4] might be interesting). But when packet aggregation is enabled OGMs are hold back before being aggregated and rebroadcasted. So you have an inherent jitter anyway.
ciao, axel
[1] WCW2007 Graz http://open-mesh.net/batman/experience/WCWGraz2007/batman_Graz_WCW_2007.pdf
[2] Tradeoff: Long versus short but poor-link paths: http://downloads.open-mesh.net/batman/development/sources/batmand-exp_0.3-al...
[3] The impact of asymmetric links: http://downloads.open-mesh.net/batman/development/sources/batmand-exp_0.3-al...
[4] Re-broadcast delay: http://downloads.open-mesh.net/batman/development/sources/batmand-exp_0.3-al...
- Exponential convergence
Consider the following network topology, where all links have a 40% packet loss rate:
A1---A2---A3---A4 / \
/ \ S C \ | \ | B1---B2---B3---B4--B5
Assume an OGM broadcast interval of 1, C will receive one OGM from S every 13 seconds through the A route, and one OGM from S through the B route every 21 seconds.
Suppose now that A1 crashes. Then C has no indication that the B route is better than the A route until the next OGM from S through B, which will happen after roughly 10 seconds on average.
More generally, the time for C to switch to the fallback route is proportional to (1-a)^-k, where a is the per-link loss rate, and k is the number of hops on the fallback route.
- Persistent loops
BATMAN does not contain any loop avoidance mechanisms, nor any for loop detection. Because of that, BATMAN will cause routing loops in some cases which will last up to PURGE_TIMEOUT seconds (256 seconds by default).
Indeed, consider the following topology:
A l1/| / | S |l3 \ | l2\| B
Suppose also that l2 is a very lossy link, so that B has selected A as its next hop for S.
S crashes, and A switches to B as its next hop for S. At this point, B is still using A as its next hop, so we have a temporary routing loop.
After one window time, both A and B are performing opportunistic routing (Section 6.3.1 of the draft) and hence form a routing loop. I may be missing something, but as far as I can tell, the loop will only be eliminated after a PURGE_TIMEOUT.
Due to the convergence issues outlined in point (1) above, BATMAN needs the opportunistic routing mechanism. However, even in the absence of opportunistic routing, transient loop will arise for up to one window time.
- Unrealistic metric
BATMAN does not explicitly use a metric, it ranks routes according to the OGM success probability. Since OGM propagation requires correlated successes over all the links of a route, the resulting metric is
1/product(a_i)
where a_i are the loss probabilities of all the links in the route.
This metric does not accurately reflect the cost of a route for two reasons:
(i) it assumes there is no link-layer ARQ, which is not the case in 802.11; (ii) it only reflects probability in the backward direction.
Point (i) will cause BATMAN to favour long hops over multiple short ones. Point (ii) will cause BATMAN to take asymmetric links in the wrong direction.
- No message aggregation
Most routing protocols send their routing tables as a short sequence of large packets. For example, Babel sends up to 100 routing updates in a single full-size packet; similar techniques are used in OLSR, where the link-state database is sent in as few packets as possible.
BATMAN sends each OGM in its own packet. This will cause a signi- ficant supplementary cost on link-layer technologies on which per-frame overhead is siginificant, such as IEEE 802.11 (with no 802.11e frame bursting).
- Jitter is optional
Section 5.1 of the draft says that ``jitter may be applied to avoid collisions''. Since collisions in the absence of jitter tend to cause global synchronisation and persistent series of collisions, I believe that this should be a MUST.
Acknowledgements
Thanks to Aaron Kaplan and Roman Steiner for discussing a number of the points above with me.
Juliusz Chroboczek
Hello Juliusz!
Axel and Simon are right, in the situation you have described the protocol wouldn't loop. It is a broken route.
S crashes, and A switches to B as its next hop for S. At this point, B is still using A as its next hop, so we have a temporary routing loop.
If S crashes A and B are not going to change their routing tables anymore - since S doesn't send any OGMs anymore which could trigger a change in the routing tables of A and B. So A and B stay put until the garbage collector purges the broken route. The assumption that A switches to B is false.
The reason why I believe that Batman is loop free is, that routes are updated upon receiving OGMs from the best ranking next hop. The best ranking next hop knows the best route better than me, because its likelihood of receiving OGMs is greater than the local likelihood. If a locally received OGM - received via the best ranking neighbor - triggers an update it is granted that the best ranking next hop is already informed - naturally,,, So the chain of potential forwarders is loop free until the destination.
This is particularly so because the current implementations are sequence number based, rather than using timers. This is granting that temporary loops can't occur - which could happen if timers could occasionally trigger purges that are not synced.
This is consistent with the test results of Batman-Experimental at Meraka in a physical 49-node grid. I did some inital tests there on the 4 longest possible routes in the grid (from each corner to each opposite corner, from each center of side to center of opposite side) at -30 dBm transmit power, -30dbm receive attenuation. 4 simultaneous traffic streams were saturating the capacity of these routes, all colliding in the center of the grid. Links were very weak, with high packet loss to single hop neighbors even when the whole network was idle. So a round trip from corner to corner often took 12 - 15 hops.
Under no circumstances did the protocol loop. Also we did run a long series of tests with Batman-EXP and OLSR with ETX and Fisheye enabled, that David Johnson had previously used to compare mesh routing protocols, like AODV, DART, OLSR-RFC, OLSR with ETX .
http://wirelessafrica.meraka.org.za/wiki/images/9/98/Batman_ifip.pdf
cu elektra
Hi -
the Meraka server is not available, so the link to the PDF is currently broken. Use http://downloads.open-mesh.net/batman/misc/batman_ifip.pdf for now.
cu elektra
- BATMAN does not contain any loop avoidance mechanisms; in the presence of so-called ``optimistic routing'', BATMAN may exhibit persistent routing loops. Even without optimistic routing, BATMAN will exhibit transitory routing loops.
Here I completely agree with what Simon Wunderlich said. Personally, I am not aware of any scenario that can lead to transient routing loops, not even to a dead node! You might know better, then pleas let me know :-)
Each batman node is only supposed to rebroadcast an OGM which it has received from the neighbor which it has configured itself as its best nexthop to the destination. Therefore, the OGM itself must have travelled a valid path with all traversed nodes having a loop free route configured towards the originator of the OGM. Of course, the configured end-to-end route might break down somewhere along the path just after an OGM has been received. This will cause a temporary dead route. But not a loop. Later on, only a newer OGM (with a larger sequence number), and which once again must have traversed a functional and configured path, can reconfigure any node receiving this OGM.
Another remark: The batmand-0.2 implementation is not doing this completely correct and thus can cause temporary routing loops. The reason is, that it also reconfigures its best nexthop due to new OGMs which were not received via its currently *best-known* nexthop.
Just for the record: Few days ago I recognized that for my gcc a 32-bit-word shifted by 32 bits is NOT zero. This pointed me to another bug in the batmand-0.2.x implementation which could cause a transient routing loop.
This and the above bug should be fixed with the attached patch.
ciao, axel
b.a.t.m.a.n@lists.open-mesh.org