Hello David,
here you have our patchset for net/linux-4.2 which contains only patches that we think to be important (meaning they fix critical crashes/misbehaviours actually reported by some users).
Patch 1 (by me) is preventing DAT from injecting replies received from the mesh into the LAN which would confuse a L2 bridge.
Patch 2 introduces several NULL checks in order to prevent spurious kernel crashes due to NULL pointer deferences, by Marek Lindnder.
Patches 3, still by Marek, prevent accidental double deletions of tt_local_entry objects from their own lists which would lead to a kernel crash.
Patch 4 by Simon Wunderlich fixes a memory leak which is triggered by the missing initialization of the bandwidth_up/down fields of the bat-GW struct.
If possible, I'd recommend to consider all these patches for inclusion in the stable releases.
Please pull or let me know if anything else is wrong!
Thanks a lot David, Antonio
The following changes since commit 2475b22526d70234ecfe4a1ff88aed69badefba9:
xen-netback: Allocate fraglist early to avoid complex rollback (2015-08-03 22:23:03 -0700)
are available in the git repository at:
git://git.open-mesh.org/linux-merge.git tags/batman-adv-fix-for-davem
for you to fetch changes up to 27a4d5efd417b6ef3190e9af357715532d4617a3:
batman-adv: initialize up/down values when adding a gateway (2015-08-05 00:31:47 +0200)
---------------------------------------------------------------- Included changes: - prevent DAT from replying on behalf of local clients and confuse L2 bridges - fix crash on double list removal of TT objects (tt_local_entry) - fix crash due to missing NULL checks - initialize bw values for new GWs objects to prevent memory leak
---------------------------------------------------------------- Antonio Quartulli (1): batman-adv: avoid DAT to mess up LAN state
Marek Lindner (2): batman-adv: fix kernel crash due to missing NULL checks batman-adv: protect tt_local_entry from concurrent delete events
Simon Wunderlich (1): batman-adv: initialize up/down values when adding a gateway
net/batman-adv/distributed-arp-table.c | 18 +++++++++++++----- net/batman-adv/gateway_client.c | 2 ++ net/batman-adv/soft-interface.c | 3 +++ net/batman-adv/translation-table.c | 29 ++++++++++++++++++++++++----- 4 files changed, 42 insertions(+), 10 deletions(-)
When a node running DAT receives an ARP request from the LAN for the first time, it is likely that this node will request the ARP entry through the distributed ARP table (DAT) in the mesh.
Once a DAT reply is received the asking node must check if the MAC address for which the IP address has been asked is local. If it is, the node must drop the ARP reply bceause the client should have replied on its own locally.
Forwarding this reply means fooling any L2 bridge (e.g. Ethernet switches) lying between the batman-adv node and the LAN. This happens because the L2 bridge will think that the client sending the ARP reply lies somewhere in the mesh, while this node is sitting in the same LAN.
Reported-by: Simon Wunderlich sw@simonwunderlich.de Signed-off-by: Marek Lindner mareklindner@neomailbox.ch Signed-off-by: Antonio Quartulli antonio@meshcoding.com --- net/batman-adv/distributed-arp-table.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/net/batman-adv/distributed-arp-table.c b/net/batman-adv/distributed-arp-table.c index fb54e6a..6d0b471 100644 --- a/net/batman-adv/distributed-arp-table.c +++ b/net/batman-adv/distributed-arp-table.c @@ -1138,6 +1138,9 @@ void batadv_dat_snoop_outgoing_arp_reply(struct batadv_priv *bat_priv, * @bat_priv: the bat priv with all the soft interface information * @skb: packet to check * @hdr_size: size of the encapsulation header + * + * Returns true if the packet was snooped and consumed by DAT. False if the + * packet has to be delivered to the interface */ bool batadv_dat_snoop_incoming_arp_reply(struct batadv_priv *bat_priv, struct sk_buff *skb, int hdr_size) @@ -1145,7 +1148,7 @@ bool batadv_dat_snoop_incoming_arp_reply(struct batadv_priv *bat_priv, uint16_t type; __be32 ip_src, ip_dst; uint8_t *hw_src, *hw_dst; - bool ret = false; + bool dropped = false; unsigned short vid;
if (!atomic_read(&bat_priv->distributed_arp_table)) @@ -1174,12 +1177,17 @@ bool batadv_dat_snoop_incoming_arp_reply(struct batadv_priv *bat_priv, /* if this REPLY is directed to a client of mine, let's deliver the * packet to the interface */ - ret = !batadv_is_my_client(bat_priv, hw_dst, vid); + dropped = !batadv_is_my_client(bat_priv, hw_dst, vid); + + /* if this REPLY is sent on behalf of a client of mine, let's drop the + * packet because the client will reply by itself + */ + dropped |= batadv_is_my_client(bat_priv, hw_src, vid); out: - if (ret) + if (dropped) kfree_skb(skb); - /* if ret == false -> packet has to be delivered to the interface */ - return ret; + /* if dropped == false -> deliver to the interface */ + return dropped; }
/**
From: Marek Lindner mareklindner@neomailbox.ch
batadv_softif_vlan_get() may return NULL which has to be verified by the caller.
Fixes: 35df3b298fc8 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add") Reported-by: Ryan Thompson ryan@eero.com Signed-off-by: Marek Lindner mareklindner@neomailbox.ch Signed-off-by: Antonio Quartulli antonio@meshcoding.com --- net/batman-adv/soft-interface.c | 3 +++ net/batman-adv/translation-table.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index c002961..a2fc843 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -479,6 +479,9 @@ out: */ void batadv_softif_vlan_free_ref(struct batadv_softif_vlan *vlan) { + if (!vlan) + return; + if (atomic_dec_and_test(&vlan->refcount)) { spin_lock_bh(&vlan->bat_priv->softif_vlan_list_lock); hlist_del_rcu(&vlan->list); diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index b482495..38b83c5 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -594,6 +594,9 @@ bool batadv_tt_local_add(struct net_device *soft_iface, const uint8_t *addr,
/* increase the refcounter of the related vlan */ vlan = batadv_softif_vlan_get(bat_priv, vid); + if (WARN(!vlan, "adding TT local entry %pM to non-existent VLAN %d", + addr, BATADV_PRINT_VID(vid))) + goto out;
batadv_dbg(BATADV_DBG_TT, bat_priv, "Creating new local tt entry: %pM (vid: %d, ttvn: %d)\n", @@ -1066,6 +1069,9 @@ uint16_t batadv_tt_local_remove(struct batadv_priv *bat_priv,
/* decrease the reference held for this vlan */ vlan = batadv_softif_vlan_get(bat_priv, vid); + if (!vlan) + goto out; + batadv_softif_vlan_free_ref(vlan); batadv_softif_vlan_free_ref(vlan);
@@ -1166,8 +1172,10 @@ static void batadv_tt_local_table_free(struct batadv_priv *bat_priv) /* decrease the reference held for this vlan */ vlan = batadv_softif_vlan_get(bat_priv, tt_common_entry->vid); - batadv_softif_vlan_free_ref(vlan); - batadv_softif_vlan_free_ref(vlan); + if (vlan) { + batadv_softif_vlan_free_ref(vlan); + batadv_softif_vlan_free_ref(vlan); + }
batadv_tt_local_entry_free_ref(tt_local); } @@ -3207,8 +3215,10 @@ static void batadv_tt_local_purge_pending_clients(struct batadv_priv *bat_priv)
/* decrease the reference held for this vlan */ vlan = batadv_softif_vlan_get(bat_priv, tt_common->vid); - batadv_softif_vlan_free_ref(vlan); - batadv_softif_vlan_free_ref(vlan); + if (vlan) { + batadv_softif_vlan_free_ref(vlan); + batadv_softif_vlan_free_ref(vlan); + }
batadv_tt_local_entry_free_ref(tt_local); }
From: Of Antonio Quartulli
Sent: 05 August 2015 13:52 From: Marek Lindner mareklindner@neomailbox.ch
batadv_softif_vlan_get() may return NULL which has to be verified by the caller.
...
diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c index c002961..a2fc843 100644 --- a/net/batman-adv/soft-interface.c +++ b/net/batman-adv/soft-interface.c @@ -479,6 +479,9 @@ out: */ void batadv_softif_vlan_free_ref(struct batadv_softif_vlan *vlan) {
- if (!vlan)
return;
This bit doesn't look necessary. You've added checks to some callers, the others probably don't need the check.
@@ -1066,6 +1069,9 @@ uint16_t batadv_tt_local_remove(struct batadv_priv *bat_priv,
/* decrease the reference held for this vlan */ vlan = batadv_softif_vlan_get(bat_priv, vid);
- if (!vlan)
goto out;
- batadv_softif_vlan_free_ref(vlan); batadv_softif_vlan_free_ref(vlan);
That code is ringing alarm bells. If you expect to have a reference count the object better exist. If you can remove a reference count from a 'random' object then you can break the reference counting of objects.
So is this test just hiding anoter bug somewhere??
David
On 05/08/15 15:15, David Laight wrote:
So is this test just hiding anoter bug somewhere??
Hi David and thanks for your feedback.
The point is that we got several bug reports of kernel crashes due to NULL pointer deferences in these lines and fter having debugged this problem for quite a while we preferred to move on and propose this patch.
Still, I am personally debugging this part of the code to understand if we really have something wrong or if this NULL pointer is something we should expect (and therefore check).
For the time being we think this patch is better than having horrible kernel crashes, but I hope to come to a definitive conclusion soon.
Cheers,
From: Antonio Quartulli
Sent: 11 August 2015 17:43 On 05/08/15 15:15, David Laight wrote:
So is this test just hiding anoter bug somewhere??
Hi David and thanks for your feedback.
The point is that we got several bug reports of kernel crashes due to NULL pointer deferences in these lines and after having debugged this problem for quite a while we preferred to move on and propose this patch.
That is what I thought.
Still, I am personally debugging this part of the code to understand if we really have something wrong or if this NULL pointer is something we should expect (and therefore check).
For the time being we think this patch is better than having horrible kernel crashes, but I hope to come to a definitive conclusion soon.
If you don't know why you are seeing the NULL pointer then you are just papering over some cracks and it is likely that something is really badly wrong somewhere (ie missing locking). This could easily mean that there are some much harder to find bugs lurking there.
David
From: Marek Lindner mareklindner@neomailbox.ch
The tt_local_entry deletion performed in batadv_tt_local_remove() was neither protecting against simultaneous deletes nor checking whether the element was still part of the list before calling hlist_del_rcu().
Replacing the hlist_del_rcu() call with batadv_hash_remove() provides adequate protection via hash spinlocks as well as an is-element-still-in-hash check to avoid 'blind' hash removal.
Fixes: 068ee6e204e1 ("batman-adv: roaming handling mechanism redesign") Reported-by: alfonsname@web.de Signed-off-by: Marek Lindner mareklindner@neomailbox.ch Signed-off-by: Antonio Quartulli antonio@meshcoding.com --- net/batman-adv/translation-table.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index 38b83c5..5e953297 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -1037,6 +1037,7 @@ uint16_t batadv_tt_local_remove(struct batadv_priv *bat_priv, struct batadv_tt_local_entry *tt_local_entry; uint16_t flags, curr_flags = BATADV_NO_FLAGS; struct batadv_softif_vlan *vlan; + void *tt_entry_exists;
tt_local_entry = batadv_tt_local_hash_find(bat_priv, addr, vid); if (!tt_local_entry) @@ -1064,7 +1065,15 @@ uint16_t batadv_tt_local_remove(struct batadv_priv *bat_priv, * immediately purge it */ batadv_tt_local_event(bat_priv, tt_local_entry, BATADV_TT_CLIENT_DEL); - hlist_del_rcu(&tt_local_entry->common.hash_entry); + + tt_entry_exists = batadv_hash_remove(bat_priv->tt.local_hash, + batadv_compare_tt, + batadv_choose_tt, + &tt_local_entry->common); + if (!tt_entry_exists) + goto out; + + /* extra call to free the local tt entry */ batadv_tt_local_entry_free_ref(tt_local_entry);
/* decrease the reference held for this vlan */
From: Simon Wunderlich simon@open-mesh.com
Without this initialization, gateways which actually announce up/down bandwidth of 0/0 could be added. If these nodes get purged via _batadv_purge_orig() later, the gw_node structure does not get removed since batadv_gw_node_delete() updates the gw_node with up/down bandwidth of 0/0, and the updating function then discards the change and does not free gw_node.
This results in leaking the gw_node structures, which references other structures: gw_node -> orig_node -> orig_node_ifinfo -> hardif. When removing the interface later, the open reference on the hardif may cause hangs with the infamous "unregister_netdevice: waiting for mesh1 to become free. Usage count = 1" message.
Signed-off-by: Simon Wunderlich simon@open-mesh.com Signed-off-by: Marek Lindner mareklindner@neomailbox.ch Signed-off-by: Antonio Quartulli antonio@meshcoding.com --- net/batman-adv/gateway_client.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/batman-adv/gateway_client.c b/net/batman-adv/gateway_client.c index bb015862..cffa92d 100644 --- a/net/batman-adv/gateway_client.c +++ b/net/batman-adv/gateway_client.c @@ -439,6 +439,8 @@ static void batadv_gw_node_add(struct batadv_priv *bat_priv,
INIT_HLIST_NODE(&gw_node->list); gw_node->orig_node = orig_node; + gw_node->bandwidth_down = ntohl(gateway->bandwidth_down); + gw_node->bandwidth_up = ntohl(gateway->bandwidth_up); atomic_set(&gw_node->refcount, 1);
spin_lock_bh(&bat_priv->gw.list_lock);
From: Antonio Quartulli antonio@meshcoding.com Date: Wed, 5 Aug 2015 14:51:43 +0200
git://git.open-mesh.org/linux-merge.git tags/batman-adv-fix-for-davem
Pulled and queued up for -stable, thanks.
b.a.t.m.a.n@lists.open-mesh.org