[B.A.T.M.A.N.] Suggestion for routing improvement on poor links

List overview All Threads
Download

newer

older

[B.A.T.M.A.N.] [PATCH] batman-adv...

[B.A.T.M.A.N.] [PATCH] batctl: add...

whangarei & opua

19 Feb 2014 19 Feb '14

8:56 p.m.

Hi all,

have running batman-adv 2013.4.0 on a hamburg.freifunk.net out-of-the- box router. Have 3 uplinks over Wlan, with trees in the way and some rain at this time... So not perfect conditions...

Have seen via 'batctl o' 3 uplinks to the other side of the park. (logging was unfortunately not compiled in) Of cause, one was the best link based on the metric (or what you called it) at this time.

But also seen that the metric was not changed during packet lost => increased last-seen time on this best link :-(

So there was a entry with a last-seen of up to 180 sec while other links have had much better update times because of less packet lost, but not so good metric like the main link at his 'best time'. So the 'old' best link was also active, at least by the metric shown by 'batctl o'. Expect the traffic will be sent to this way...

Is it maybe a good idea to decrease the metric after a last seen time of a links has increased 15 or 20 sec, step by step on the best link (or all links with increased last seen time) , so a more reliable link ( at this time) has a chance to be activated? I only think about the local routing decision, not about to announce anything to a neighbor...

Sorry, have not much experiences with Wlan links, also be not familiar with programming, but with some layer 3 routing protocols... Hope you understand my issue anyway :)

By, Joe

Show replies by date

Antonio Quartulli

19 Feb 19 Feb

9:27 p.m.

On 19/02/14 21:56, whangarei & opua wrote:

...

Hi all,

have running batman-adv 2013.4.0 on a hamburg.freifunk.net out-of-the-box router. Have 3 uplinks over Wlan, with trees in the way and some rain at this time... So not perfect conditions...

Have seen via 'batctl o' 3 uplinks to the other side of the park. (logging was unfortunately not compiled in) Of cause, one was the best link based on the metric (or what you called it) at this time.

But also seen that the metric was not changed during packet lost => increased last-seen time on this best link :-(

So there was a entry with a last-seen of up to 180 sec while other links have had much better update times because of less packet lost, but not so good metric like the main link at his 'best time'. So the 'old' best link was also active, at least by the metric shown by 'batctl o'. Expect the traffic will be sent to this way...

Is it maybe a good idea to decrease the metric after a last seen time of a links has increased 15 or 20 sec, step by step on the best link (or all links with increased last seen time) , so a more reliable link ( at this time) has a chance to be activated? I only think about the local routing decision, not about to announce anything to a neighbor...

If I got your question properly I'd say that batman-adv already performs (in a similar way) what you are suggesting.

First of all I try to summarize your scenario. You have: - a source node S, where you are reading the output of "batctl o" - a destination node D, which you use to check the output of the command above - two neighbours of S, say N1 and N2 - N1 is the best nexthop towards D - at some point the link between S and N1 becomes unusable and we have really high packet loss -> no OGMs are received via this neighbour anymore.

At this point S still receives the OGMs generated by D via N2. Each time one of these packets is received the metric towards D "via N1" decreases a little bit. When this metric "via N1" reaches a value that is smaller then the metric "via N2" we have a route change: N2 becomes the bext nexthop. You can check this by monitoring the TQ (name of the batman-adv metric) towards D via N1 in "batctl o" (this command prints the "originator table").

How many seconds you need to see this switch depends on the current metric values via N1 and via N2.

I hope this clarifies.

Cheers,

-- Antonio Quartulli

whangarei & opua

10:12 p.m.

On 19.02.2014, at 22:27, Antonio Quartulli wrote:

...

On 19/02/14 21:56, whangarei & opua wrote:

...
Hi all,

have running batman-adv 2013.4.0 on a hamburg.freifunk.net out-of-the-box router. Have 3 uplinks over Wlan, with trees in the way and some rain at this time... So not perfect conditions...

Have seen via 'batctl o' 3 uplinks to the other side of the park. (logging was unfortunately not compiled in) Of cause, one was the best link based on the metric (or what you called it) at this time.

But also seen that the metric was not changed during packet lost => increased last-seen time on this best link :-(

So there was a entry with a last-seen of up to 180 sec while other links have had much better update times because of less packet lost, but not so good metric like the main link at his 'best time'. So the 'old' best link was also active, at least by the metric shown by 'batctl o'. Expect the traffic will be sent to this way...

Is it maybe a good idea to decrease the metric after a last seen time of a links has increased 15 or 20 sec, step by step on the best link (or all links with increased last seen time) , so a more reliable link ( at this time) has a chance to be activated? I only think about the local routing decision, not about to announce anything to a neighbor...

If I got your question properly I'd say that batman-adv already performs (in a similar way) what you are suggesting.

First of all I try to summarize your scenario. You have:

a source node S, where you are reading the output of "batctl o"

a destination node D, which you use to check the output of the

command above

two neighbours of S, say N1 and N2

N1 is the best nexthop towards D

at some point the link between S and N1 becomes unusable and we have

really high packet loss -> no OGMs are received via this neighbour anymore.

At this point S still receives the OGMs generated by D via N2. Each time one of these packets is received the metric towards D "via N1" decreases a little bit. When this metric "via N1" reaches a value that is smaller then the metric "via N2" we have a route change: N2 becomes the bext nexthop. You can check this by monitoring the TQ (name of the batman-adv metric) towards D via N1 in "batctl o" (this command prints the "originator table").

How many seconds you need to see this switch depends on the current metric values via N1 and via N2.

I hope this clarifies.

Cheers,

-- Antonio Quartulli

Hi Antonio,

have only a live environment, so only have access to my own router, but anyway, looks you aware of this issue...

But I have seen no change of the TQ over the time ( up to 180 sec) on my side with my (maybe old) version of batman. Is it an new feature, or it is maybe not working like expected, in my old version? :(

Or I'm a little bit blind, sometimes, maybe? ;-)

Joe

Antonio Quartulli

10:30 p.m.

On 19/02/14 23:12, whangarei & opua wrote:

...

Hi Antonio,

have only a live environment, so only have access to my own router, but anyway, looks you aware of this issue...

But I have seen no change of the TQ over the time ( up to 180 sec) on my side with my (maybe old) version of batman. Is it an new feature, or it is maybe not working like expected, in my old version? :(

I think this is something that is part of batman-adv since the beginning (somebody else can confirm this).

...

Or I'm a little bit blind, sometimes, maybe? ;-)

I have the feeling that you are not looking at the right entry in the originator table :-)

You may want to report some output/real numbers (at different times) so that we can comment on those.

Cheers,

-- Antonio Quartulli

Andrew Lunn

20 Feb 20 Feb

8:54 a.m.

On Wed, Feb 19, 2014 at 11:30:36PM +0100, Antonio Quartulli wrote:

...

On 19/02/14 23:12, whangarei & opua wrote:

...
Hi Antonio,

have only a live environment, so only have access to my own router, but anyway, looks you aware of this issue...

But I have seen no change of the TQ over the time ( up to 180 sec) on my side with my (maybe old) version of batman. Is it an new feature, or it is maybe not working like expected, in my old version? :(

I think this is something that is part of batman-adv since the beginning (somebody else can confirm this).

The routing protocol is known to have problems when a node suddenly disappears. A received OGM is what triggers updates to the metric. If you stop receiving OGMs the metric is no longer updated until the node is purged as dead. This for example causes problems with nomadic/mobile nodes. They can go around a corner, loss line of sight, but still be considered the best route until purged as dead.

This design problem will be fixed with the BATMAN V. It has a second protocol which is used between one hop peers and should quickly detect if a peer has disappeared.

Andrew

Antonio Quartulli

9:03 a.m.

On 20/02/14 09:54, Andrew Lunn wrote:

...

The routing protocol is known to have problems when a node suddenly disappears. A received OGM is what triggers updates to the metric. If you stop receiving OGMs the metric is no longer updated until the node is purged as dead. This for example causes problems with nomadic/mobile nodes. They can go around a corner, loss line of sight, but still be considered the best route until purged as dead.

But if you keep receiving OGMs via another neighbour you will have a route switch *before* the old nexthop is considered as dead.

-- Antonio Quartulli

Andrew Lunn

9:09 a.m.

On Thu, Feb 20, 2014 at 10:03:01AM +0100, Antonio Quartulli wrote:

...

On 20/02/14 09:54, Andrew Lunn wrote:

...
The routing protocol is known to have problems when a node suddenly disappears. A received OGM is what triggers updates to the metric. If you stop receiving OGMs the metric is no longer updated until the node is purged as dead. This for example causes problems with nomadic/mobile nodes. They can go around a corner, loss line of sight, but still be considered the best route until purged as dead.

But if you keep receiving OGMs via another neighbour you will have a route switch *before* the old nexthop is considered as dead.

Hi Antonio

That is not what i have seen in practice. Because the metric is good, and does not degrade, it stays as the best route. That is one of the reasons Linus developed NDP while at Ascom.

Andrew

Antonio Quartulli

9:44 a.m.

On 20/02/14 10:09, Andrew Lunn wrote:

...

On Thu, Feb 20, 2014 at 10:03:01AM +0100, Antonio Quartulli wrote:

...
On 20/02/14 09:54, Andrew Lunn wrote:

...
The routing protocol is known to have problems when a node suddenly disappears. A received OGM is what triggers updates to the metric. If you stop receiving OGMs the metric is no longer updated until the node is purged as dead. This for example causes problems with nomadic/mobile nodes. They can go around a corner, loss line of sight, but still be considered the best route until purged as dead.

But if you keep receiving OGMs via another neighbour you will have a route switch *before* the old nexthop is considered as dead.

Hi Antonio

Hi Andrew,

...

That is not what i have seen in practice. Because the metric is good, and does not degrade,

The missing degradation is the part where I don't agree.

Just to be sure we are understanding each other, I am talking about the scenario depicted in this picture:

http://www.open-mesh.org/attachments/download/52/triangle.png

'A' is the source node and 'B' is our destination. B moves and breaks the line-of-sight with A, thus making the A<->B link unusable at all (we assume that now packet loss on A<->B is 100%).

At this point A still receives B's OGMs via N1.

According to batadv_iv_ogm_orig_update() (in bat_iv_ogm.c) each time a packet with a _new_seqno_ is received the global window of _each_ neighbour for the given originator is shifted by one slot and the averages are computed again.

This operation makes the average degrade because we are now averaging N-1 old values and one 0 (with N being the size of the global window). On the next OGM it will be worse: average on N-2 values and two 0s. And so on..

Doesn't this mean that the metric is degrading (consider that the metric is the average)?

Later in the same function, after having shifted all the windows and recomputed all the averages, batman-adv checks if the route switch can now happen:

1076 if (router_ifinfo->bat_iv.tq_avg > neigh_ifinfo->bat_iv.tq_avg)

(tq_avg of the current router is compared to tq_avg of the neighbour from which we have received the OGM)

At some point this condition will evaluate to false.

...

it stays as the best route. That is one of the reasons Linus developed NDP while at Ascom.

Of course the current mechanism is far from being "fast", therefore we all wait for NDP/ELP to make the whole thing much more responsive :-)

Cheers,

-- Antonio Quartulli

Andrew Lunn

10:10 a.m.

...

Hi Andrew,

...
That is not what i have seen in practice. Because the metric is good, and does not degrade,

The missing degradation is the part where I don't agree.

Just to be sure we are understanding each other, I am talking about the scenario depicted in this picture:

http://www.open-mesh.org/attachments/download/52/triangle.png

Thanks for the diagram. Yes, Linus and I had a somewhat similar setup. We had more nodes involved, and B was walking around the inside of a building.

...

'A' is the source node and 'B' is our destination. B moves and breaks the line-of-sight with A, thus making the A<->B link unusable at all (we assume that now packet loss on A<->B is 100%).

At this point A still receives B's OGMs via N1.

According to batadv_iv_ogm_orig_update() (in bat_iv_ogm.c) each time a packet with a _new_seqno_ is received the global window of _each_ neighbour for the given originator is shifted by one slot and the averages are computed again.

It is a couple of years since Linus investigated this. So maybe things have changed. If it does work like this, great, that helps solves a problem we had. I don't currently have access to a system to test this though.

Andrew

Marek Lindner

10:33 a.m.

On Thursday 20 February 2014 10:44:37 Antonio Quartulli wrote:

...

The missing degradation is the part where I don't agree.

Just to be sure we are understanding each other, I am talking about the scenario depicted in this picture:

http://www.open-mesh.org/attachments/download/52/triangle.png

'A' is the source node and 'B' is our destination. B moves and breaks the line-of-sight with A, thus making the A<->B link unusable at all (we assume that now packet loss on A<->B is 100%).

At this point A still receives B's OGMs via N1.

According to batadv_iv_ogm_orig_update() (in bat_iv_ogm.c) each time a packet with a _new_seqno_ is received the global window of _each_ neighbour for the given originator is shifted by one slot and the averages are computed again.

This operation makes the average degrade because we are now averaging N-1 old values and one 0 (with N being the size of the global window). On the next OGM it will be worse: average on N-2 values and two 0s. And so on..

Doesn't this mean that the metric is degrading (consider that the metric is the average)?

Your explanation is mostly correct - one minor objection though: Values of '0' are not considered when the global average is computed (bat_iv_ogm.c line 73). The idea being: The unilateral degradation of TQ values without any network event will eventually lead to loops. Nonetheless, the general idea of your statement still holds true: Since new sequence numbers keep coming in via an alternative, albeit less optimal route, the stale route will be purged as soon as the global TQ window has elapsed (default: 5 seqnos). Long before the neighbor timeout has had the time to purge the neighbor entirely.

@Andrew: The algorithm always worked that way. In fact, it was your suggestion to reduce the global window to 5 seqnos in order to speed it up. Furthermore, ELP only improves reaction time on a local basis (single hop neighborhood). Network-wide route updates are as slow as before which is why we had to devise yet-another-improvement: RIP http://www.open-mesh.org/projects/batman-adv/wiki/RIP

Cheers, Marek

3928

Age (days ago)

3929

Last active (days ago)

b.a.t.m.a.n@lists.open-mesh.org

9 comments

4 participants

tags (0)

participants (4)

Andrew Lunn
Antonio Quartulli
Marek Lindner
whangarei & opua