This is the second iteration of the patches, with the following changes (thanks Antonio for review): * split out style patch no 3 * change 4addr check to use only payload (DATA) subtype in the first patch * count number of VLANs instead of comparing VLANs explicitly in the last patch
Original (slightly updated) description here:
Upon debugging a network with dangling, bogus TT entries, I found a couple of bugs for which I would like to propose fixes. The network showed the following symptoms:
* DAT was enabled * VLANs were used, but not every originator used the same VLANs * I've found global entries assigned to originators which actually never had the client in question connected to them. These false target originators didn't even had the VLANs in use or bridged for which the entry was done. This caused packets sent to be sent to these wrong originators. * The wrong entries also didn't get purged automatically, since they didn't announce the VLAN in question through their TT TLVLs, and the TT code didn't check for excess VLANs. * Furthermore, the temp flag was removed too early from the TT entries so that the purge function was not removing the entry after a timeout as well. * I've found that with DAT, the cached ARP replies may cause these TT entries to be created on behalf of the answering host. This is wrong, since the answering host (usually) doesn't actually has the client connected.
We have not yet tested these patches in the network, and the bug also appears only once in a couple of days. Therefore I'd like to ask to review these patches thoroughly, and if you agree to the fixes we will apply them in this production network.
Thanks, Simon
Simon Wunderlich (3): batman-adv: fix speedy join for DAT cache replies batman-adv: avoid keeping false temporary entry batman-adv: detect local excess vlans in TT request
net/batman-adv/routing.c | 19 +++++++++++++++---- net/batman-adv/translation-table.c | 24 +++++++++++++++++++++--- 2 files changed, 36 insertions(+), 7 deletions(-)
DAT Cache replies are answered on behalf of other clients which are not connected to the answering originator. Therefore, we shouldn't add these clients to the answering originators TT table through speed join to avoid bogus entries.
Reported-by: Alessandro Bolletta alessandro@mediaspot.net Signed-off-by: Simon Wunderlich sw@simonwunderlich.de --- Changes to PATCH: * only allow speedy join for BATADV_P_DATA to avoid adding more corner cases in the future --- net/batman-adv/routing.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c index c360c0c..96b5daa 100644 --- a/net/batman-adv/routing.c +++ b/net/batman-adv/routing.c @@ -836,6 +836,7 @@ int batadv_recv_unicast_packet(struct sk_buff *skb, uint8_t *orig_addr; struct batadv_orig_node *orig_node = NULL; int check, hdr_size = sizeof(*unicast_packet); + enum batadv_subtype subtype; bool is4addr;
unicast_packet = (struct batadv_unicast_packet *)skb->data; @@ -863,10 +864,20 @@ int batadv_recv_unicast_packet(struct sk_buff *skb, /* packet for me */ if (batadv_is_my_mac(bat_priv, unicast_packet->dest)) { if (is4addr) { - batadv_dat_inc_counter(bat_priv, - unicast_4addr_packet->subtype); - orig_addr = unicast_4addr_packet->src; - orig_node = batadv_orig_hash_find(bat_priv, orig_addr); + subtype = unicast_4addr_packet->subtype; + batadv_dat_inc_counter(bat_priv, subtype); + + /* Only payload data should be considered for speedy + * join. For example, DAT also uses unicast 4addr + * types, but those packets should not be considered + * for speedy join, since the clients do not actually + * reside at the sending originator. + */ + if (subtype == BATADV_P_DATA) { + orig_addr = unicast_4addr_packet->src; + orig_node = batadv_orig_hash_find(bat_priv, + orig_addr); + } }
if (batadv_dat_snoop_incoming_arp_request(bat_priv, skb,
On 26/08/15 16:41, Simon Wunderlich wrote:
DAT Cache replies are answered on behalf of other clients which are not connected to the answering originator. Therefore, we shouldn't add these clients to the answering originators TT table through speed join to avoid bogus entries.
Reported-by: Alessandro Bolletta alessandro@mediaspot.net Signed-off-by: Simon Wunderlich sw@simonwunderlich.de
Acked-by: Antonio Quartulli antonio@meshcoding.com
In the case when a temporary entry is added first and a proper tt entry is added after that, the temporary tt entry is kept in the orig list. However the temporary flag is removed at this point, and therefore the purge function can not find this temporary entry anymore.
Therefore, remove the previous temp entry before adding the new proper one.
Reported-by: Alessandro Bolletta alessandro@mediaspot.net Signed-off-by: Simon Wunderlich sw@simonwunderlich.de --- net/batman-adv/translation-table.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 7986ec5..f629c21 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -1416,9 +1416,15 @@ static bool batadv_tt_global_add(struct batadv_priv *bat_priv, }
/* if the client was temporary added before receiving the first - * OGM announcing it, we have to clear the TEMP flag + * OGM announcing it, we have to clear the TEMP flag. Also, + * remove the previous temporary orig node and re-add it + * if required. If the orig entry changed, the new one which + * is a non-temporary entry is preferred. */ - common->flags &= ~BATADV_TT_CLIENT_TEMP; + if (common->flags & BATADV_TT_CLIENT_TEMP) { + batadv_tt_global_del_orig_list(tt_global_entry); + common->flags &= ~BATADV_TT_CLIENT_TEMP; + }
/* the change can carry possible "attribute" flags like the * TT_CLIENT_WIFI, therefore they have to be copied in the
On 26/08/15 16:41, Simon Wunderlich wrote:
In the case when a temporary entry is added first and a proper tt entry is added after that, the temporary tt entry is kept in the orig list. However the temporary flag is removed at this point, and therefore the purge function can not find this temporary entry anymore.
When can this really happen? When a temporary client has roamed to a new originator before it could be claimed by the first one ? Or are there other cases ? I think it is important to make this clear, because the original logic was such as the expected behaviour was to receive an ADD event from the same originator where the temporary client was seen.
Other than this, the patch looks good.
On Tuesday 01 September 2015 11:00:29 Antonio Quartulli wrote:
On 26/08/15 16:41, Simon Wunderlich wrote:
In the case when a temporary entry is added first and a proper tt entry is added after that, the temporary tt entry is kept in the orig list. However the temporary flag is removed at this point, and therefore the purge function can not find this temporary entry anymore.
When can this really happen? When a temporary client has roamed to a new originator before it could be claimed by the first one ? Or are there other cases ? I think it is important to make this clear, because the original logic was such as the expected behaviour was to receive an ADD event from the same originator where the temporary client was seen.
Other than this, the patch looks good.
I've seen the problem in that problematic case which was fixed in PATCHv1. Practically, it is unlikely to happen unless there is a malicious attacker or another bug like the one described earlier - that is, speedy join is triggered without any actual TT update.
The main problem I've perceived here was that the TT entry got the temporary flag removed even if the original sender didn't supply a proper TT announcement yet. The reason for this was that it got a proper reply from another orig node.
The roaming case you've described would also cause the temporary client to be removed, but might be added later through a ''proper'' TT announcement.
Do you think I should include any of this in the commit message?
Cheers, Simon
On 01/09/15 13:00, Simon Wunderlich wrote:
On Tuesday 01 September 2015 11:00:29 Antonio Quartulli wrote:
On 26/08/15 16:41, Simon Wunderlich wrote:
In the case when a temporary entry is added first and a proper tt entry is added after that, the temporary tt entry is kept in the orig list. However the temporary flag is removed at this point, and therefore the purge function can not find this temporary entry anymore.
When can this really happen? When a temporary client has roamed to a new originator before it could be claimed by the first one ? Or are there other cases ? I think it is important to make this clear, because the original logic was such as the expected behaviour was to receive an ADD event from the same originator where the temporary client was seen.
Other than this, the patch looks good.
I've seen the problem in that problematic case which was fixed in PATCHv1. Practically, it is unlikely to happen unless there is a malicious attacker or another bug like the one described earlier - that is, speedy join is triggered without any actual TT update.
The main problem I've perceived here was that the TT entry got the temporary flag removed even if the original sender didn't supply a proper TT announcement yet. The reason for this was that it got a proper reply from another orig node.
The roaming case you've described would also cause the temporary client to be removed, but might be added later through a ''proper'' TT announcement.
Do you think I should include any of this in the commit message?
I think you should mention something that makes this case "real": I'd just say that it is possible that a client detected behind a given originator moves before an actual claim is made, so triggering this particular configuration.
I hope you think this is fine.
Cheers,
On Tuesday 01 September 2015 13:30:43 Antonio Quartulli wrote:
On 01/09/15 13:00, Simon Wunderlich wrote:
On Tuesday 01 September 2015 11:00:29 Antonio Quartulli wrote:
On 26/08/15 16:41, Simon Wunderlich wrote:
In the case when a temporary entry is added first and a proper tt entry is added after that, the temporary tt entry is kept in the orig list. However the temporary flag is removed at this point, and therefore the purge function can not find this temporary entry anymore.
When can this really happen? When a temporary client has roamed to a new originator before it could be claimed by the first one ? Or are there other cases ? I think it is important to make this clear, because the original logic was such as the expected behaviour was to receive an ADD event from the same originator where the temporary client was seen.
Other than this, the patch looks good.
I've seen the problem in that problematic case which was fixed in PATCHv1. Practically, it is unlikely to happen unless there is a malicious attacker or another bug like the one described earlier - that is, speedy join is triggered without any actual TT update.
The main problem I've perceived here was that the TT entry got the temporary flag removed even if the original sender didn't supply a proper TT announcement yet. The reason for this was that it got a proper reply from another orig node.
The roaming case you've described would also cause the temporary client to be removed, but might be added later through a ''proper'' TT announcement.
Do you think I should include any of this in the commit message?
I think you should mention something that makes this case "real": I'd just say that it is possible that a client detected behind a given originator moves before an actual claim is made, so triggering this particular configuration.
Sounds good, I'll extend the commit message.
Thanks! Simon
If the local representation of the global TT table of one originator has more VLAN entries than the respective TT update, there is some inconsistency present. By detecting and reporting this inconsistency, the global table gets updated and the excess VLAN will get removed in the process.
Reported-by: Alessandro Bolletta alessandro@mediaspot.net Signed-off-by: Simon Wunderlich sw@simonwunderlich.de --- Changes to PATCH: * count instead of explicitly comparing VLANs, which makes it much simpler (thank Antonio) --- net/batman-adv/translation-table.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index f629c21..e7f2998 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -2391,8 +2391,8 @@ static bool batadv_tt_global_check_crc(struct batadv_orig_node *orig_node, { struct batadv_tvlv_tt_vlan_data *tt_vlan_tmp; struct batadv_orig_node_vlan *vlan; + int i, orig_num_vlan; uint32_t crc; - int i;
/* check if each received CRC matches the locally stored one */ for (i = 0; i < num_vlan; i++) { @@ -2418,6 +2418,18 @@ static bool batadv_tt_global_check_crc(struct batadv_orig_node *orig_node, return false; }
+ /* check if any excess VLANs exist locally for the originator + * which are not mentioned in the TVLV from the originator. + */ + rcu_read_lock(); + orig_num_vlan = 0; + list_for_each_entry_rcu(vlan, &orig_node->vlan_list, list) + orig_num_vlan++; + rcu_read_unlock(); + + if (orig_num_vlan > num_vlan) + return false; + return true; }
On 26/08/15 16:41, Simon Wunderlich wrote:
If the local representation of the global TT table of one originator has more VLAN entries than the respective TT update, there is some inconsistency present. By detecting and reporting this inconsistency, the global table gets updated and the excess VLAN will get removed in the process.
This a nice catch, but I am not sure this is the right way of implementing the fix.
Imagine this sequence of events: 1) originator O1 sends an OGM 2) client C1 connects to O1 on a newly created VLAN and starts sending traffic 3) originator O2 detects (speedy join) C1 before receiving the O1's OGM 4) O2 receives O1's OGM and the check will kill C1 because its VLAN is not advertised in the OGM. O2 needs to wait for O1's next OGM before getting to know C1 again
Maybe this scenario is rather unlikely? What do you think?
On Tuesday 01 September 2015 12:07:36 Antonio Quartulli wrote:
On 26/08/15 16:41, Simon Wunderlich wrote:
If the local representation of the global TT table of one originator has more VLAN entries than the respective TT update, there is some inconsistency present. By detecting and reporting this inconsistency, the global table gets updated and the excess VLAN will get removed in the process.
This a nice catch, but I am not sure this is the right way of implementing the fix.
Imagine this sequence of events:
- originator O1 sends an OGM
- client C1 connects to O1 on a newly created VLAN and starts sending
traffic 3) originator O2 detects (speedy join) C1 before receiving the O1's OGM 4) O2 receives O1's OGM and the check will kill C1 because its VLAN is not advertised in the OGM. O2 needs to wait for O1's next OGM before getting to know C1 again
Maybe this scenario is rather unlikely? What do you think?
Mhm, I'd argue its rather unlikely. This would imply that the data frame is reaching O2 faster the OGM. I don't think its very likely that it overtakes it, and even if it happens, its probably a very tight race.
Also, I'd think that nodes typically just use or don't use a VLAN, and don't create VLANs all the time ...
Even if that happens, the race is resolved with one originator interval.
I'm open for better ideas, though. :)
Cheers, Simon
On 02/09/15 19:39, Simon Wunderlich wrote:
On Tuesday 01 September 2015 12:07:36 Antonio Quartulli wrote:
On 26/08/15 16:41, Simon Wunderlich wrote:
If the local representation of the global TT table of one originator has more VLAN entries than the respective TT update, there is some inconsistency present. By detecting and reporting this inconsistency, the global table gets updated and the excess VLAN will get removed in the process.
This a nice catch, but I am not sure this is the right way of implementing the fix.
Imagine this sequence of events:
- originator O1 sends an OGM
- client C1 connects to O1 on a newly created VLAN and starts sending
traffic 3) originator O2 detects (speedy join) C1 before receiving the O1's OGM 4) O2 receives O1's OGM and the check will kill C1 because its VLAN is not advertised in the OGM. O2 needs to wait for O1's next OGM before getting to know C1 again
Maybe this scenario is rather unlikely? What do you think?
Mhm, I'd argue its rather unlikely. This would imply that the data frame is reaching O2 faster the OGM. I don't think its very likely that it overtakes it, and even if it happens, its probably a very tight race.
Also, I'd think that nodes typically just use or don't use a VLAN, and don't create VLANs all the time ...
Even if that happens, the race is resolved with one originator interval.
Yeah, I think we can live with this "limitation".
Cheers,
b.a.t.m.a.n@lists.open-mesh.org