This patch is not a fix for the issue itself, but a fix for the other nodes, which are also influenced by the issue.
- Some (affected) nodes send TT announcements for an empty vlan (for now only seen with vlan_id = 0). - This behaviour is a bug! Batman-adv nodes must not send TT announcements for empty vlans. - The receiving batman-adv can not handle incoming TT announcements with empty vlans. (The crc check routine batadv_tt_global_check_crc() fails.) - As the receiving node thinks, the crc is broken, it does a full table request then. - The announcing node sends a TT full table to the receiving node then, which also contains the empty vlan, so *batadv_tt_global_check_crc()* fails again on the receiver side. - This causes a lot of unneeded TT traffic. In Freifunk Hannover we decreased the amount of from ~20kpp/s to ~4kpp/s on our supernodes. We have ~800 nodes, which are connected via vpn to one of six supernodes. Those supernodes are connected with each other in a fully connected mesh.
Signed-off-by: Leonardo Mörlein me@irrelefant.net --- net/batman-adv/translation-table.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..d2a843fd 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -2996,8 +2996,21 @@ static bool batadv_tt_global_check_crc(struct batadv_orig_node *orig_node,
vlan = batadv_orig_node_vlan_get(orig_node, ntohs(tt_vlan_tmp->vid)); - if (!vlan) - return false; + if (!vlan) { + /* Due to a bug, some originators send TT + * announcements for empty vlans. As the receiving + * nodes will ignore those empty vlans (they do not + * add a batadv_orig_node_vlan into the transglobal + * for the originating node), crc check will fail + * here. To circumvent this issue, we skip the + * verification for the vlan if the crc is + * equal to 0x00000000. + */ + if (tt_vlan_tmp->crc == 0x00000000) + continue; + else + return false; + }
crc = vlan->tt.crc; batadv_orig_node_vlan_put(vlan);
On Sonntag, 6. Mai 2018 21:55:59 CEST Leonardo Mörlein wrote:
This patch is not a fix for the issue itself, but a fix for the other nodes, which are also influenced by the issue.
- Some (affected) nodes send TT announcements for an empty vlan (for now only seen with vlan_id = 0).
- This behaviour is a bug! Batman-adv nodes must not send TT announcements for empty vlans.
- The receiving batman-adv can not handle incoming TT announcements with empty vlans. (The crc check routine batadv_tt_global_check_crc() fails.)
- As the receiving node thinks, the crc is broken, it does a full table request then.
- The announcing node sends a TT full table to the receiving node then, which also contains the empty vlan, so *batadv_tt_global_check_crc()* fails again on the receiver side.
- This causes a lot of unneeded TT traffic. In Freifunk Hannover we decreased the amount of from ~20kpp/s to ~4kpp/s on our supernodes. We have ~800 nodes, which are connected via vpn to one of six supernodes. Those supernodes are connected with each other in a fully connected mesh.
Can you attach a (small) pcap of actual traffic which show a packet which triggers this behavior?
Thanks, Sven
On 05/07/2018 08:32 AM, Sven Eckelmann wrote:
Can you attach a (small) pcap of actual traffic which show a packet which triggers this behavior?
For sure. See for the originator 5e:25:f7:13:2d:eb and vid=0x8000 in the attached pcap. I selected the interested packets (one ogm, one full table request and one full table response).
Kind regards, Leonardo Mörlein
On Montag, 7. Mai 2018 10:06:20 CEST me wrote:
On 05/07/2018 08:32 AM, Sven Eckelmann wrote:
Can you attach a (small) pcap of actual traffic which show a packet which triggers this behavior?
For sure. See for the originator 5e:25:f7:13:2d:eb and vid=0x8000 in the attached pcap. I selected the interested packets (one ogm, one full table request and one full table response).
Thanks,
These are always from intermediate nodes. I was hoping to see the initial packet which "introduced" the incorrect vid=0x8000 without entry in the first place to better understand the problem which causes this.
At least I would have hoped that this might help the people at WBM to have a better discussion about it. At least to understand whether this is from an intermediate node or from the sender.
On Sonntag, 6. Mai 2018 21:55:59 CEST Leonardo Mörlein wrote: [...]
- The receiving batman-adv can not handle incoming TT announcements with empty vlans. (The crc check routine batadv_tt_global_check_crc() fails.)
This part is not 100% clear in the commit message.
Right now it just sounds like softif_vlan_list (see batadv_tt_prepare_tvlv_local_data) has an "empty" vlan entry (an entry without mac addresses). Which brings me to the point that batadv_softif_create_vlan initializes new vlan's with the CRC 0x0 before it adds it to the list (see tt/ batadv_vlan_tt in batadv_softif_vlan). We know now that the function batadv_tt_local_update_crc is responsible for updating the crc (holding the tt.commit_lock) on the sender site (for non-intermediate nodes). And this crc is calculated over the bat_priv->tt.local_hash - which (according to the crc 0x0) doesn't have a single (non-NEW) entry for this VID. The return (new crc) is therefore 0x0 and it will be submitted this way.
The same should be true for the receiver. The batadv_tt_global_update_crc goes over the list of entries for this vid. And we just received a full table which only contains non-vlan entries and no vlan 0 entries. My assumption would therefore be that the crc on the receiver should also be 0x0 when batadv_tt_global_update_crc was done. But if it is not then it would mean that there was still some entry on the receiver which batadv_tt_global_crc used to calculate the crc for this vid or the crc was just not calculated for it (because it doesn't exist).
The latter would match the content of the full table reply in your pcap, the comment in the patch and the info info I got from Antonio: The vid entry will only be created in the originator vlan_list when an entry was received (I haven't checked this). This would sound to me like the vlan should just be created instead of adding this workaround. But maybe you should discuss the details here with Antonio because at the end it is just about the different ways of handling "!vlan" case from your patch.
Btw. Antonio, wasn't the VLAN 0 always created on each device in the local table by the VLAN driver? See the message
[ 29.866038] 8021q: adding VLAN 0 to HW filter on device bat0
when batman-adv attaches itself to the vlan monitoring. But usually, there is at least one entry in it (the bat0 mac address). Not sure why this isn't the case here. At least batadv_interface_add_vid should add it at the end. Maybe there was an error before the entry was created? Would be interesting to know where exactly.
Kind regards, Sven
n Montag, 7. Mai 2018 12:23:57 CEST Sven Eckelmann wrote:
This would sound to me like the vlan should just be created instead of adding this workaround. But maybe you should discuss the details here with Antonio because at the end it is just about the different ways of handling "!vlan" case from your patch.
Just had a look at the cleanup code for vlans and it seems like each vlan in vlan_list must contain at least one entry (according to vlan->tt.num_entries) and otherwise batadv_tt_global_size_mod will drop it. So it isn't as easy as just adding a simple vlan for each received vlan entry via TT response.
So your patch (or a variant of it) will most likely be the way to go for receivers.
Now to your commit message:
On Sonntag, 6. Mai 2018 21:55:59 CEST Leonardo Mörlein wrote:
batman-adv: mitigate issue when empty vlan is received
Something more specific would be nice. Maybe somebody else has a better idea but what about:
batman-adv: Mitigate TT loop when receiving empty VLAN
This patch is not a fix for the issue itself, but a fix for the other nodes, which are also influenced by the issue.
I personally don't like to read "this patch" in commit messages. This part should maybe be reformulated and put at the end (before the Fixes line). Something like:
While the problem is caused by the sender, the receiver can mitigate the problem by ignoring (potentially) empty VLANs during the CRC check.
[...]
- This behaviour is a bug! Batman-adv nodes must not send TT announcements for empty vlans.
Where is this defined as incorrect message? But yes, the current implementation cannot handle it.
- The receiving batman-adv can not handle incoming TT announcements with empty vlans. (The crc check routine batadv_tt_global_check_crc() fails.)
This is to vague as discussed earlier. Maybe something more like:
- The receiving batman-adv can not handle incoming TT announcements with empty vlans. No new struct batadv_orig_node_vlan entry will be added to batadv_orig_node->vlan_list. This will cause the crc check routine batadv_tt_global_check_crc() to fail because it cannot find the local representation of this empty vlan.
[...]
- This causes a lot of unneeded TT traffic. In Freifunk Hannover we decreased the amount of from ~20kpp/s to ~4kpp/s on our supernodes. We have ~800 nodes, which are connected via vpn to one of six supernodes. Those supernodes are connected with each other in a fully connected mesh.
This part doesn't seem to fit in the list and might just be a new paragraph at the end to summarize the impact of the problem.
A "Fixes: " line is missing here. Please check whether this sounds about right:
Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific")
Kind regards, Sven
On Monday, May 7, 2018 3:55:59 AM HKT Leonardo Mörlein wrote:
if (!vlan)
return false;
if (!vlan) {
/* Due to a bug, some originators send TT
* announcements for empty vlans. As the receiving
* nodes will ignore those empty vlans (they do not
* add a batadv_orig_node_vlan into the transglobal
* for the originating node), crc check will fail
* here. To circumvent this issue, we skip the
* verification for the vlan if the crc is
* equal to 0x00000000.
*/
if (tt_vlan_tmp->crc == 0x00000000)
continue;
else
return false;
}
There are some issues with this approach:
* As you might be aware, a CRC of 0x00000000 is not an invalid or uninitialized CRC. That means it is conceivable the CRC over the translation table entries and flags results in a CRC of 0x00000000 without any bugs being present. Applying this patch would lead to introducing a bug in these conditions.
* The second concern is about how to deal with 'the bug' (referring to your patch comment above). In your case 'the bug' results in missing TT entries for a given VLAN plus a CRC of 0x00000000. What if the CRC was 0x00000001 due to some other bug ? Then you're check would totally miss the mark and we're back at the beginning. If we wish to deal with such misbehavior on a scale where we assume the code should be able to deal with garbage values in some or all fields a different approach is needed. Relying on a magic number (CRC 0x00000000) is as bad as the current behavior (assuming each VLAN has an entry).
Shortly, I will send a patch to prevent sending an empty VLAN around. Please note that this patch still is more of a band-aid. We still don't know how the empty VLAN came into being in the first place. Plus, that patch won't protect against garbage packets which certainly requires more effort.
Cheers, Marek
A translation table TVLV changset sent with an OGM consists of a number of headers (one per VLAN) plus the changeset itself (addition and/or deletion of entries).
The per-VLAN headers are used by OGM recipients for consistency checks. Said consistency check might determine that a full translation table request is needed to restore consistency. If the TT sender adds per-VLAN headers of empty VLANs into the OGM, recipients are led to believe to have reached an inconsistent state and thus request a full table update. The full table does not contain empty VLANs (due to missing entries) the cycle restarts when the next OGM is issued.
Consequently, when the translation table TVLV headers are composed, empty VLANs are to be excluded.
Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Marek Lindner mareklindner@neomailbox.ch --- net/batman-adv/translation-table.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..def8d109 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -855,7 +855,8 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node, s32 *tt_len) { u16 num_vlan = 0; - u16 num_entries = 0; + u16 vlan_entries = 0; + u16 total_entries = 0; u16 change_offset; u16 tvlv_len; struct batadv_tvlv_tt_vlan_data *tt_vlan; @@ -864,8 +865,12 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,
rcu_read_lock(); hlist_for_each_entry_rcu(vlan, &orig_node->vlan_list, list) { + vlan_entries = atomic_read(&vlan->tt.num_entries); + if (vlan_entries < 1) + continue; + num_vlan++; - num_entries += atomic_read(&vlan->tt.num_entries); + total_entries += vlan_entries; }
change_offset = sizeof(**tt_data); @@ -873,7 +878,7 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,
/* if tt_len is negative, allocate the space needed by the full table */ if (*tt_len < 0) - *tt_len = batadv_tt_len(num_entries); + *tt_len = batadv_tt_len(total_entries);
tvlv_len = *tt_len; tvlv_len += change_offset; @@ -890,6 +895,10 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,
tt_vlan = (struct batadv_tvlv_tt_vlan_data *)(*tt_data + 1); hlist_for_each_entry_rcu(vlan, &orig_node->vlan_list, list) { + vlan_entries = atomic_read(&vlan->tt.num_entries); + if (vlan_entries < 1) + continue; + tt_vlan->vid = htons(vlan->vid); tt_vlan->crc = htonl(vlan->tt.crc);
On Mittwoch, 9. Mai 2018 18:06:05 CEST Marek Lindner wrote: [...]
Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific")
I know, my fault but please use 21a57f6e7a3b instead of 7ea7b4a14275. The latter is in the official Linux repository and not batman-adv.git.
[...]
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..def8d109 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -855,7 +855,8 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,
What about batadv_tt_prepare_tvlv_local_data?
Kind regards, Sven
On Mittwoch, 9. Mai 2018 19:21:32 CEST Sven Eckelmann wrote: [...]
[...]
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..def8d109 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -855,7 +855,8 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,
What about batadv_tt_prepare_tvlv_local_data?
Let me rephrase that: Are we sure that this is from intermediate nodes or is there still the possibility that it is from the actual originator? Regarding the latter: Should batadv_tt_prepare_tvlv_local_data be modified or will you wait for Leonardo to debug why it happens in the first place?
Kind regards, Sven
On Mittwoch, 9. Mai 2018 20:20:58 CEST Sven Eckelmann wrote:
On Mittwoch, 9. Mai 2018 19:21:32 CEST Sven Eckelmann wrote: [...]
[...]
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..def8d109 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -855,7 +855,8 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,
What about batadv_tt_prepare_tvlv_local_data?
Let me rephrase that: Are we sure that this is from intermediate nodes or is there still the possibility that it is from the actual originator? Regarding the latter: Should batadv_tt_prepare_tvlv_local_data be modified or will you wait for Leonardo to debug why it happens in the first place?
Wait. When it is a problem of an intermediate node that doesn't have (for whatever reason) not a valid copy of the entries, shouldn't the intermediate node just ignore the request and not answer it? Otherwise the next OGM with the data from batadv_tt_prepare_tvlv_local_data from the actual node will just trigger the next request by the receiver.
And the next thing is that your commit message is talking about OGMs all the time. But the OGM TVLV are not filled by the batadv_tt_prepare_tvlv_global_data but by the batadv_tt_prepare_tvlv_local_data. I am unable to find out what you actually doing here and why you do it.
Kind regards, Sven
On 05/09/2018 10:38 PM, Sven Eckelmann wrote:
Wait. When it is a problem of an intermediate node that doesn't have (for whatever reason) not a valid copy of the entries, shouldn't the intermediate node just ignore the request and not answer it?
Yes. It's forwarded here
https://elixir.bootlin.com/linux/v4.16.8/source/net/batman-adv/translation-t...
if the check here
https://elixir.bootlin.com/linux/v4.16.8/source/net/batman-adv/translation-t...
fails. I also wrote a debug patch which adds counters which differentiate between crc misses, ttvn misses and other misses. I can also prepare them, when it is desired.
And the next thing is that your commit message is talking about OGMs all the time. But the OGM TVLV are not filled by the batadv_tt_prepare_tvlv_global_data but by the batadv_tt_prepare_tvlv_local_data. I am unable to find out what you actually doing here and why you do it.
Yes. IIRC the checksums in the OGMs are actually causing the issue. The checksums in the TT responses are not evaluated for now.
Kind regards, Leonardo Mörlein
On 10/05/18 04:38, Sven Eckelmann wrote:
Let me rephrase that: Are we sure that this is from intermediate nodes or is there still the possibility that it is from the actual originator? Regarding the latter: Should batadv_tt_prepare_tvlv_local_data be modified or will you wait for Leonardo to debug why it happens in the first place?
Wait. When it is a problem of an intermediate node that doesn't have (for whatever reason) not a valid copy of the entries, shouldn't the intermediate node just ignore the request and not answer it? Otherwise the next OGM with the data from batadv_tt_prepare_tvlv_local_data from the actual node will just trigger the next request by the receiver.
And the next thing is that your commit message is talking about OGMs all the time. But the OGM TVLV are not filled by the batadv_tt_prepare_tvlv_global_data but by the batadv_tt_prepare_tvlv_local_data. I am unable to find out what you actually doing here and why you do it.
I believe batadv_tt_prepare_tvlv_local_data() was the actual function that Marek wanted to modify. But we probably spent too much time digging into translation-table.c these days and I managed to confuse him enough.
Thanks for pointing this out, Sven!
Cheers,
On Thursday, May 10, 2018 2:20:58 AM HKT Sven Eckelmann wrote:
On Mittwoch, 9. Mai 2018 19:21:32 CEST Sven Eckelmann wrote: [...]
[...]
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..def8d109 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -855,7 +855,8 @@ batadv_tt_prepare_tvlv_global_data(struct batadv_orig_node *orig_node,>
What about batadv_tt_prepare_tvlv_local_data?
Let me rephrase that: Are we sure that this is from intermediate nodes or is there still the possibility that it is from the actual originator? Regarding the latter: Should batadv_tt_prepare_tvlv_local_data be modified or will you wait for Leonardo to debug why it happens in the first place?
We believe to have an idea why this happens in the first place and I have another patch in the works which Leonardo should test in his setup.
Cheers, Marek
A translation table TVLV changset sent with an OGM consists of a number of headers (one per VLAN) plus the changeset itself (addition and/or deletion of entries).
The per-VLAN headers are used by OGM recipients for consistency checks. Said consistency check might determine that a full translation table request is needed to restore consistency. If the TT sender adds per-VLAN headers of empty VLANs into the OGM, recipients are led to believe to have reached an inconsistent state and thus request a full table update. The full table does not contain empty VLANs (due to missing entries) the cycle restarts when the next OGM is issued.
Consequently, when the translation table TVLV headers are composed, empty VLANs are to be excluded.
Fixes: 21a57f6e7a3b ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Marek Lindner mareklindner@neomailbox.ch ---
v2: * moved changes into batadv_tt_prepare_tvlv_local_data() * updated 'fixes' commit id
net/batman-adv/translation-table.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 0225616d..b51c034a 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -931,15 +931,20 @@ batadv_tt_prepare_tvlv_local_data(struct batadv_priv *bat_priv, struct batadv_tvlv_tt_vlan_data *tt_vlan; struct batadv_softif_vlan *vlan; u16 num_vlan = 0; - u16 num_entries = 0; + u16 vlan_entries = 0; + u16 total_entries = 0; u16 tvlv_len; u8 *tt_change_ptr; int change_offset;
rcu_read_lock(); hlist_for_each_entry_rcu(vlan, &bat_priv->softif_vlan_list, list) { + vlan_entries = atomic_read(&vlan->tt.num_entries); + if (vlan_entries < 1) + continue; + num_vlan++; - num_entries += atomic_read(&vlan->tt.num_entries); + total_entries += vlan_entries; }
change_offset = sizeof(**tt_data); @@ -947,7 +952,7 @@ batadv_tt_prepare_tvlv_local_data(struct batadv_priv *bat_priv,
/* if tt_len is negative, allocate the space needed by the full table */ if (*tt_len < 0) - *tt_len = batadv_tt_len(num_entries); + *tt_len = batadv_tt_len(total_entries);
tvlv_len = *tt_len; tvlv_len += change_offset; @@ -964,6 +969,10 @@ batadv_tt_prepare_tvlv_local_data(struct batadv_priv *bat_priv,
tt_vlan = (struct batadv_tvlv_tt_vlan_data *)(*tt_data + 1); hlist_for_each_entry_rcu(vlan, &bat_priv->softif_vlan_list, list) { + vlan_entries = atomic_read(&vlan->tt.num_entries); + if (vlan_entries < 1) + continue; + tt_vlan->vid = htons(vlan->vid); tt_vlan->crc = htonl(vlan->tt.crc);
On Donnerstag, 10. Mai 2018 16:41:37 CEST Marek Lindner wrote:
rcu_read_lock(); hlist_for_each_entry_rcu(vlan, &bat_priv->softif_vlan_list, list) {
vlan_entries = atomic_read(&vlan->tt.num_entries);
if (vlan_entries < 1)
continue;
num_vlan++;
num_entries += atomic_read(&vlan->tt.num_entries);
total_entries += vlan_entries; }
Please submit it on top of the maint branch before submitting it the next time. The rcu_read_lock isn't there anymore.
The rest also applies only with some offsets (which sound like you have some other modifications in your branch).
Kind regards, Sven
On Thursday, May 10, 2018 11:50:48 PM HKT Sven Eckelmann wrote:
Please submit it on top of the maint branch before submitting it the next time. The rcu_read_lock isn't there anymore.
I did not intend this patch to go into maint (hence the missing maint subject prefix) as I considered that to be more of a sanity check. The "fix adding VLANs with partial state" is the real fix.
Though, I am fine with maint too.
Cheers, Marek
b.a.t.m.a.n@lists.open-mesh.org