[PATCH RFC] batman-adv: BATMAN V: use/prefer 11s airtime link metric

List overview All Threads
Download

newer

older

[PATCH] batman-adv: fix panic...

[PATCH 00/10] pull request for...

Linus Lüssing

18 Jan 2025 18 Jan '25

12:35 a.m.

With an 11s interface and HWMP then this keeps track of a throughput estimation internally already, as specified by 802.11-2020, section 14.9.2. The HWMP code even makes use of the Minstrel provided expected throughput if available and is therefore very close to this expected throughput value, except the specification adds some constant penalty for: "Channel access overhead (in μs), which includes frame headers, training sequences, access protocol frames, etc."

When no expected throughput is available then HWMP keeps track of the average packet delivery error rate and average phy rate to calculate its own expected throughput value.

So the 11s airtime link metric should be a slightly better estimate than the expected throughput provided by Minstrel. And should be significantly better than our raw PHY rate divided by 3 guestimate fallback.

Therefore this should significantly improve the accuracy for BATMAN V when using drivers like ath10k/ath11k/ath12k/mt76 which all do not implement/export an expected throughput information.

Signed-off-by: Linus Lüssing linus.luessing@c0d3.blue --- RFC because: * only tested in a VM with mac80211_hwsim, checked that the value from sinfo.airtime_link_metric is used and that "batctl o"/"batctl n" still (nearly) matches the "expected throughput" in "iw dev wlan0 station dump" * still needs testing / verification on real devices * I'm a bit confused about the extra "* 100" I had to apply to make the values match, not quite sure where that comes from?

net/batman-adv/bat_v_elp.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/net/batman-adv/bat_v_elp.c b/net/batman-adv/bat_v_elp.c index 1d704574e6bf..014489f7f947 100644 --- a/net/batman-adv/bat_v_elp.c +++ b/net/batman-adv/bat_v_elp.c @@ -18,6 +18,7 @@ #include <linux/if_ether.h> #include <linux/jiffies.h> #include <linux/kref.h> +#include <linux/limits.h> #include <linux/minmax.h> #include <linux/netdevice.h> #include <linux/nl80211.h> @@ -56,6 +57,25 @@ static void batadv_v_elp_start_timer(struct batadv_hard_iface *hard_iface) msecs_to_jiffies(msecs)); }

+/** + * batadv_v_elp_get_throughput_from_11s() - get the throughput from 11s link + * @airtime: airtime link metric to a neighbor from an 11s link + * + * Return: The throughput towards the given neighbour in multiples of 100kpbs + * (a value of '1' equals 0.1Mbps, '10' equals 1Mbps, etc). + */ +static u32 batadv_v_elp_get_throughput_from_11s(u32 airtime) +{ + const int tu_to_airtime_unit = 100; + const int test_frame_len = 8192; + const int tu_to_us = 1024; + + if (!airtime) + return U32_MAX; + + return test_frame_len * 100 * tu_to_airtime_unit / (airtime * tu_to_us); +} + /** * batadv_v_elp_get_throughput() - get the throughput towards a neighbour * @neigh: the neighbour for which the throughput has to be obtained @@ -69,7 +89,7 @@ static u32 batadv_v_elp_get_throughput(struct batadv_hardif_neigh_node *neigh) struct ethtool_link_ksettings link_settings; struct net_device *real_netdev; struct station_info sinfo; - u32 throughput; + u32 throughput, airtime; int ret;

/* if the user specified a customised value for this interface, then @@ -109,6 +129,11 @@ static u32 batadv_v_elp_get_throughput(struct batadv_hardif_neigh_node *neigh) if (ret) goto default_throughput;

+ if (sinfo.filled & BIT(NL80211_STA_INFO_AIRTIME_LINK_METRIC)) { + airtime = sinfo.airtime_link_metric; + return batadv_v_elp_get_throughput_from_11s(airtime); + } + if (sinfo.filled & BIT(NL80211_STA_INFO_EXPECTED_THROUGHPUT)) return sinfo.expected_throughput / 100;

-- 2.47.1

Show replies by date

Andrew Strohman

18 Jan 18 Jan

2 a.m.

Recently, I've been evaluating routing with multiple-link (multi-radio) in my network with batman v.

I found that adding a 2.4ghz radio to my bat interface has caused instability and poor performance compared to just running batman with the 5ghz radio only.

My board uses mt7915e and mt7622-wmac drivers for 5 and 2.4ghz, respectively.

My first problem is that mt7915e doesn't use mistral, but mt7622-wmac does. I've noticed that sta_get_expected_throughput() returns a higher rate than (tx rate / 3). Since this is not an apples-to-apples comparison, there is bias for using the 2.4ghz next hops. So, I think it would be good, if we tried to make an effort to use a consistent method to determine bandwidth across all radios, even if some radios have better methods of doing so than others. In my example, it might be better to use (tx rate / 3) for mt7622-wmac even though that driver/minstrel supports getting the expected throughput.

I think that this change probably does a better job for the alternative to expected bandwidth. That is to say, I think that using (tx rate + considering fail avg) is better than just dividing the tx rate by 3. But this proposed method still results in my expected bandwidth being derived differently based on the radio.

HWMP doesn't need to consider this, because it only supports one radio. Regardless of which method is used to calculate metric, it will be used consistently for all possible next hops, because there is only one wireless driver involved.

The second problem I have, seems to be that sta_get_expected_throughput() returns a bandwidth which is an over-estimate. For example, it estimates 150Mb/s. But really, I'm only getting 25Mb/s, or less on the link. I *think* the expected bandwidth delivered by minstrel is not considering the fact that the radio cannot tx as often as it would like due to contention. The return value seems to reflect that fact that we tx to the sta at a high rate, but doesn't reflect the fact that it's hard to get an opportunity to tx. I'm not 100% sure about this yet.

Again, this is not something that is as important to HWMP, because there is only one radio, on one frequency. As such, the contention will be somewhat uniform across the stas. But in my multi-radio case (and probably many others), 2.4GHz is way more crowded than 5Ghz. So it would be good to somehow account for this when we are choosing the best next hop among multiple radios.

Andy

Marek Lindner

4:59 a.m.

On Saturday, 18 January 2025 03:00:07 CET Andrew Strohman wrote:

...

The second problem I have, seems to be that sta_get_expected_throughput() returns a bandwidth which is an over-estimate. For example, it estimates 150Mb/s. But really, I'm only getting 25Mb/s, or less on the link. I *think* the expected bandwidth delivered by minstrel is not considering the fact that the radio cannot tx as often as it would like due to contention. The return value seems to reflect that fact that we tx to the sta at a high rate, but doesn't reflect the fact that it's hard to get an opportunity to tx.

It is important to point out that batman-adv is not trying to get an 'accurate' knowledge of the throughput. The throughput metric is an estimate and the important aspect is that the method of estimating the throughput is consistent across all radios on the same AP. This is necessary to make the estimated throughput values comparable. At the end of the day, the routing algorithm has to make an informed decision about which route is better, not getting the most accurate throughput measurement.

Please also keep in mind that the accuracy of any 'measured' throughput value over WiFi is temporary (in real world setups). If you measured 5 minutes later you might get a different throughput value due to interference, traffic from other mesh participants, the weather, etc.

FYI, expected throughput and also 802.11 throughput estimation are taking congestion into account. If you believe this isn't sufficient to get an accurate read of the situation, can you please expand on your findings? Note that the data rate fallback (tx rate / 3) is the exception to this rule.

...

HWMP doesn't need to consider this, because it only supports one radio.

Where do you see the difference to expected throughput? Expected throughput and data rate also is per radio and neighbor.

...

I found that adding a 2.4ghz radio to my bat interface has caused instability and poor performance compared to just running batman with the 5ghz radio only.

With batman-adv throughput metric the 5GHz radio should be preferred due to the higher throughput of the radio. Can you please share details about your setup and highlight why you believe 2.4GHz is chosen over 5GHz.

Cheers, Marek

Andrew Strohman

19 Jan 19 Jan

3:20 a.m.

Hi Marek,

...

the important aspect is that the method of estimating the throughput is consistent across all radios on the same AP. This is necessary to make the estimated throughput values comparable.

Yes, I agree, and that is what my point is. The current implementation and what is being proposed here prefer to use sta_get_expected_throughput(), if available, and then fall back to examining the tx rate more directly. While both of these methods attempt to estimate throughput, one method may reliably result in over estimation while another method may reliably result in underestimation.

In my case, my 2.4ghz radio driver uses minstrel for rate control, so throughput estimates are derived using sta_get_expected_throughput(). For me, this estimation is chronically an over estimate. The 5ghz radio does rate control in hardware, so we cannot use the sta_get_expected_throughput() method for it. As such, we fall back to using the less prefered method of determination. Currently, that means tx rate / 3 (which is an under-estimate).

This results in my network perferring 2.4ghz paths when it should be preferring 5ghz paths. The problem is that throughput calculation method is not consistent across radios.

I know that both these methods of throughput estimation are trying to estimate the same thing, but they are implemented differently. There implementation details can result in a bias to over or under estimation.

I'm suggesting that we make an effort to make the throughput calculation method consistent across radios. More specifically, if one radio doesn't support sta_get_expected_throughput(), then we shouldn't use that method for any radio -- all radios should use the same fallback mechanism.

Does this make sense?

The more consistent the outcomes of the methods of throughput estimation are, the less problematic what I'm describing becomes.

After this patch, it means that the throughput estimation for 5ghz stas/neighbors in my network will be derived by examining an exponentially weighted average of tx rate with consideration of tx failures. If this new fallback method results in in more similar results to sta_get_expected_throughput(), then my problem will be lessened, possibly to the point of my network preferring 5ghz (as should be done).

But as long as we keep an implementation where we have different throughput calculation methods for different radios, we will remain susceptible to what I'm describing.

...

FYI, expected throughput and also 802.11 throughput estimation are taking congestion into account. If you believe this isn't sufficient to get an accurate read of the situation, can you please expand on your findings?

OK, thanks. If you're confident that sta_get_expected_throughput() returns a result that reflects the recent or likely external contention on the operating frequency, that's good to know. I was worried that my overestimated result was a reflection of how fast we could tx towards a client once the opportunity presented itself. But given your remark here, it sounds like the answer to this is "no" -- the throughput estimate should reflect external congestion, such as tx from other BSS's on the same frequency.

Like I noted in my original message, I was seeing the estimated throughput as 150Mb/s for the sta_get_expected_throughput() method, while really only able to tx at ~25Mb/s. This problem might be specific to my driver somehow, despite the fact that it uses mistrel for tx. I'll look into this more closely and report back what I find. I'll try out other chipsets (ie QCA) to see how they behave.

So in summary, I see one problem that results from different radios on the same router using different throughput determination mechanisms. This problem may get better with this change, but the underlying issue of using different methods per radio remains. In my case, I also found that sta_get_expected_throughput() delivers over-estimates. In my original message, I was considering that this could potentially be due to the fact that sta_get_expected_throughput() was not considering external congestion. But given your feedback, I'll now be debugging under the assumption that something else causes overestimation in my case.

Thanks,

Andy

Marek Lindner

3:48 a.m.

On Sunday, 19 January 2025 04:20:46 CET Andrew Strohman wrote:

...

In my case, my 2.4ghz radio driver uses minstrel for rate control, so throughput estimates are derived using sta_get_expected_throughput(). For me, this estimation is chronically an over estimate. The 5ghz radio does rate control in hardware, so we cannot use the sta_get_expected_throughput() method for it.

. [..]

...

I'm suggesting that we make an effort to make the throughput calculation method consistent across radios.

That's certainly an interesting observation but seems irrelevant to the patch proposal you are responding to. Feel free to propose a code change that aims to unify the chosen metric source across all radios on the same AP. With the current implementation, this is left to the administrator.

...

After this patch, it means that the throughput estimation for 5ghz stas/neighbors in my network will be derived by examining an exponentially weighted average of tx rate with consideration of tx failures.

After this patch, the 11s throughput estimation is available as a metric source. That's all. The patch does not even attempt to address your concern.

...

If this new fallback method results in in more similar results to sta_get_expected_throughput(), then my problem will be lessened, possibly to the point of my network preferring 5ghz (as should be done).

Even if the 11s metric source accidentally provides a similar metric in your test setup, there is no guarantee it always will. Again, your are conflating your desired outcome with a random patch which isn't trying to do what you want it to do.

...

OK, thanks. If you're confident that sta_get_expected_throughput() returns a result that reflects the recent or likely external contention on the operating frequency, that's good to know.

Feel free to read up on how minstrel arrives at the expected throughput.

...

Like I noted in my original message, I was seeing the estimated throughput as 150Mb/s for the sta_get_expected_throughput() method, while really only able to tx at ~25Mb/s.

Am I right assuming this '~25Mb/s' was measured using iperf or some other speed test? The numbers minstrel provides are in a completely different ball park and can not be compared to WiFi throughput numbers. You are also not taking into account what I have already explained why getting 'accurate' throughput numbers is meaningless.

...

I'll now be debugging under the assumption that something else causes overestimation in my case.

You are still stuck on over / under estimation. In this email alone you are mentioning it 6 times. Whether there is over or under estimation is irrelevant. Consistency is relevant.

Cheers, Marek

Linus Lüssing

4:28 a.m.

Hi Andrew,

Thanks for your feedback!

...

Currently, that means tx rate / 3 (which is an under-estimate).

I think if I recall correctly this was intentional that the fallback typically under-estimates. Generally speaking better to under-estimate than over-estimate for a fallback mechanism which uses a worse approach. The tx rate / 3 fallback is more pessimistic by design.

...

This results in my network perferring 2.4ghz paths when it should be preferring 5ghz paths.

Makes sense from the original design idea. The 5ghz radio does not provide us with an accurate expected throughput from its locked-up, hidden rate control, so we are better safe than sorry here and under-estimate it.

But shouldn't this also mean that this patch has a high chance of fixing the issue in your setup? With this patch you should get a higher, more "realistic"/comparable estimate for your 5ghz radio?

...

The problem is that throughput calculation method is not consistent across radios.

Full ACK.

...

I know that both these methods of throughput estimation are trying to estimate the same thing, but they are implemented differently.

ACK.

...

There implementation details can result in a bias to over or under estimation.

I'm suggesting that we make an effort to make the throughput calculation method consistent across radios. More specifically, if one radio doesn't support sta_get_expected_throughput(), then we shouldn't use that method for any radio -- all radios should use the same fallback mechanism.

This one I'm not sure of... different radios can still use different rate control algorithms. One radio might prefer to use higher WLAN bitrates and tolarate more loss. While another radio might be more cautious and might generally use lower WLAN bitrates, to maybe achieve less loss.

And I'm also wondering if that would result in the wrong overall incentives. Should vendors who give us more useful information really be punished for that, by us falling back to the method used with the worst, most locked-up vendor?

...

[...] In my case, I also found that sta_get_expected_throughput() delivers over-estimates.

Or the other one under-estimates ;-). Another thing to keep in mind I think an expected throughput measurment would be closer to a UDP than a TCP test. I guess your measurements were with TCP? On WiFi UDP and TCP throughput can differ quite a bit, at least from my experience.

Regards, Linus

Linus Lüssing

5:05 a.m.

On Sat, Jan 18, 2025 at 05:59:56AM +0100, Marek Lindner wrote:

...

FYI, expected throughput and also 802.11 throughput estimation are taking congestion into account.

Are they? At least in minstrel_ht_get_tp_avg() I don't see it: https://elixir.bootlin.com/linux/v6.12.6/source/net/mac80211/rc80211_minstre...

And minstrel_ht_get_expected_throughput() uses minstrel_ht_get_tp_avg(): https://elixir.bootlin.com/linux/v6.12.6/source/net/mac80211/rc80211_minstre...

Seems to me like it uses the transmission duration of the chosen WLAN bitrate, multiplies it with the average transmission success probability on this rate there. And then also factors in aggregation and cuts off the thing between 10%-90% of chosen rate.

(Also, for a rate control algorithm I think factoring in congestion would only make sense if the RC algo were also factoring in the size of the packet to transmit? That is smaller packets have a higher tolerance to channel congestion. But in the debgufs rc_stats table I don't see any column per packet size (ranges) either. I think Minstrel assumes that congestion does not make a difference for which rate to choose for the sake of simplicity?)

Linus Lüssing

5:15 a.m.

On Sun, Jan 19, 2025 at 06:05:45AM +0100, Linus Lüssing wrote:

...

On Sat, Jan 18, 2025 at 05:59:56AM +0100, Marek Lindner wrote:

...
FYI, expected throughput and also 802.11 throughput estimation are taking congestion into account.

Are they? At least in minstrel_ht_get_tp_avg() I don't see it:

On the other hand, if the channel were fully utilized then this should likely, indirectly reduce the average transmission success probability a bit. So in that case I guess congestion / channel utilization could indirectly be factored in.

But still if a channel is 90% utilized / has 90% airtime usage then this wouldn't mean that the expected throughput from Minstrel will be about 90% lower compared to an fully free channel, I guess?

Marek Lindner

18 Jan 18 Jan

5:08 a.m.

On Saturday, 18 January 2025 01:35:27 CET Linus Lüssing wrote:

...

When no expected throughput is available then HWMP keeps track of the average packet delivery error rate and average phy rate to calculate its own expected throughput value.

Is this also the case when 11s mesh forwarding is disabled?

...

So the 11s airtime link metric should be a slightly better estimate than the expected throughput provided by Minstrel. And should be significantly better than our raw PHY rate divided by 3 guestimate fallback.

Have you tested the airtime metric in real world setups or what leads you to conclude that 11s airtime link metric is better than expected throughput?

Generally speaking, I like the idea of adding another link metric source.

...

+static u32 batadv_v_elp_get_throughput_from_11s(u32 airtime) +{

```
  const int tu_to_airtime_unit = 100;
```
```
  const int test_frame_len = 8192;
```
```
  const int tu_to_us = 1024;
```

  return test_frame_len * 100 * tu_to_airtime_unit / (airtime *

tu_to_us);

Are these values constant across all platforms and drivers?

Maybe there should be a function call to an 11s function doing the conversion and handling all cases (instead of doing this in the batman-adv code)?

...

    struct station_info sinfo;

```
  u32 throughput;
```

```
  u32 throughput, airtime;
```

The Reverse Christmas Tree style should be accounted for.

Cheers, Marek

Age (days ago)

Last active (days ago)

b.a.t.m.a.n@lists.open-mesh.org

8 comments

3 participants

tags (0)

participants (3)

Andrew Strohman
Linus Lüssing
Marek Lindner