From: Simon Wunderlich simon@open-mesh.com
This series includes a couple of patches which fix some issues and prepare for the upcoming BATMAN V code.
Simon Wunderlich (3): batman-adv: fix lockdep splat when doing mcast_free batman-adv: add kerneldoc for batadv_iv_ogm_aggr_packet batman-adv: add seqno maximum age and protection start flag parameters
net/batman-adv/bat_iv_ogm.c | 12 ++++++++++-- net/batman-adv/main.h | 3 +++ net/batman-adv/multicast.c | 2 ++ net/batman-adv/routing.c | 13 ++++++++++--- net/batman-adv/routing.h | 3 ++- 5 files changed, 27 insertions(+), 6 deletions(-)
From: Simon Wunderlich simon@open-mesh.com
Signed-off-by: Simon Wunderlich simon@open-mesh.com --- net/batman-adv/multicast.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/batman-adv/multicast.c b/net/batman-adv/multicast.c index 8abf488..d984eee 100644 --- a/net/batman-adv/multicast.c +++ b/net/batman-adv/multicast.c @@ -801,7 +801,9 @@ void batadv_mcast_free(struct batadv_priv *bat_priv) batadv_tvlv_container_unregister(bat_priv, BATADV_TVLV_MCAST, 1); batadv_tvlv_handler_unregister(bat_priv, BATADV_TVLV_MCAST, 1);
+ spin_lock_bh(&bat_priv->tt.commit_lock); batadv_mcast_mla_tt_retract(bat_priv, NULL); + spin_unlock_bh(&bat_priv->tt.commit_lock); }
/**
Simon Wunderlich:
From: Simon Wunderlich simon@open-mesh.com
Signed-off-by: Simon Wunderlich simon@open-mesh.com
net/batman-adv/multicast.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/batman-adv/multicast.c b/net/batman-adv/multicast.c index 8abf488..d984eee 100644 --- a/net/batman-adv/multicast.c +++ b/net/batman-adv/multicast.c @@ -801,7 +801,9 @@ void batadv_mcast_free(struct batadv_priv *bat_priv) batadv_tvlv_container_unregister(bat_priv, BATADV_TVLV_MCAST, 1); batadv_tvlv_handler_unregister(bat_priv, BATADV_TVLV_MCAST, 1);
- spin_lock_bh(&bat_priv->tt.commit_lock); batadv_mcast_mla_tt_retract(bat_priv, NULL);
- spin_unlock_bh(&bat_priv->tt.commit_lock);
Linus,
can you please comment as of why batadv_mcast_mla_tt_retract() requires to hold the tt.commit_lock ?
I see it calls batadv_tt_local_remove() but this does not really requires the lock. Maybe you wanted to perform *all* the removes before TT could do a commit ? Or is there any other reason?
Cheers,
On Saturday 28 November 2015 10:49:59 Antonio Quartulli wrote:
Simon Wunderlich:
From: Simon Wunderlich simon@open-mesh.com
Signed-off-by: Simon Wunderlich simon@open-mesh.com
net/batman-adv/multicast.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/batman-adv/multicast.c b/net/batman-adv/multicast.c index 8abf488..d984eee 100644 --- a/net/batman-adv/multicast.c +++ b/net/batman-adv/multicast.c @@ -801,7 +801,9 @@ void batadv_mcast_free(struct batadv_priv *bat_priv)
batadv_tvlv_container_unregister(bat_priv, BATADV_TVLV_MCAST, 1); batadv_tvlv_handler_unregister(bat_priv, BATADV_TVLV_MCAST, 1);
spin_lock_bh(&bat_priv->tt.commit_lock);
batadv_mcast_mla_tt_retract(bat_priv, NULL);
spin_unlock_bh(&bat_priv->tt.commit_lock);
Linus,
can you please comment as of why batadv_mcast_mla_tt_retract() requires to hold the tt.commit_lock ?
Maybe the relevant parts of the Oops can be added to the commit message:
WARNING: CPU: 0 PID: 238 at net/batman-adv/multicast.c:142 batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv]() [...] Call Trace: [<ffffffff815fc597>] dump_stack+0x4b/0x64 [<ffffffff810b34dc>] warn_slowpath_common+0xbc/0x120 [<ffffffffa0024ec5>] ? batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv] [<ffffffff810b3705>] warn_slowpath_null+0x15/0x20 [<ffffffffa0024ec5>] batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv] [<ffffffffa00273fe>] batadv_mcast_free+0x36/0x39 [batman_adv] [<ffffffffa0020c77>] batadv_mesh_free+0x7d/0x13f [batman_adv] [<ffffffffa0036a6b>] batadv_softif_free+0x15/0x25 [batman_adv] [...]
I see it calls batadv_tt_local_remove() but this does not really requires the lock. Maybe you wanted to perform *all* the removes before TT could do a commit ? Or is there any other reason?
Cheers,
mcast.mla_list is protected by tt.commit_lock (see batadv_mcast_mla_tt_add, batadv_mcast_mla_list_free and batadv_mcast_mla_tt_retract).
Kind regards, Sven
Sven Eckelmann:
Maybe the relevant parts of the Oops can be added to the commit message:
WARNING: CPU: 0 PID: 238 at net/batman-adv/multicast.c:142 batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv]() [...] Call Trace: [<ffffffff815fc597>] dump_stack+0x4b/0x64 [<ffffffff810b34dc>] warn_slowpath_common+0xbc/0x120 [<ffffffffa0024ec5>] ? batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv] [<ffffffff810b3705>] warn_slowpath_null+0x15/0x20 [<ffffffffa0024ec5>] batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv] [<ffffffffa00273fe>] batadv_mcast_free+0x36/0x39 [batman_adv] [<ffffffffa0020c77>] batadv_mesh_free+0x7d/0x13f [batman_adv] [<ffffffffa0036a6b>] batadv_softif_free+0x15/0x25 [batman_adv] [...]
I agree
I see it calls batadv_tt_local_remove() but this does not really requires the lock. Maybe you wanted to perform *all* the removes before TT could do a commit ? Or is there any other reason?
Cheers,
mcast.mla_list is protected by tt.commit_lock (see batadv_mcast_mla_tt_add, batadv_mcast_mla_list_free and batadv_mcast_mla_tt_retract).
ok, this makes sense :)
Simon, I'd suggest you follow Sven's suggestion about adding the oops/stacktrace and then you can also append my
Acked-by: Antonio Quartulli antonio@meshcoding.com
note that this patch is a bugfix, hence it should target maint.
Cheers,
Hi,
Sorry for the late reply, I had also missed the Sven's lockdep-assert-patch back then.
On Saturday 28 November 2015 10:49:59 Antonio Quartulli wrote:
Linus,
can you please comment as of why batadv_mcast_mla_tt_retract() requires to hold the tt.commit_lock ?
I don't think it does. At least if you say that a call to batadv_tt_local_remove() as is does not need it (and it seems there are other places calling tt_local_remove() without this lock, too).
[...]
On Sat, Nov 28, 2015 at 09:21:02AM +0100, Sven Eckelmann wrote:
mcast.mla_list is protected by tt.commit_lock (see batadv_mcast_mla_tt_add, batadv_mcast_mla_list_free and batadv_mcast_mla_tt_retract).
mcast.mla_list changes should be protected by the non-parallel code flow: During runtime, batadv_mcast_mla_tt_update() is only called from the self-rearming OGM scheduler thread - batadv_mcast_mla_tt_update() will never run more than once at the same time.
The second place for mcast.mla_list changes, batadv_mcast_free(), is called only on shutdown after the OGM scheduling thread was stopped.
I don't think there should be such races regarding mcast.mla_list - was something like that observed in the wild which lead to inserting the lockdep-asserts?
Cheers, Linus
PS: But granted, even if we came to the conclusion, that a tt.commit_lock were unnecessary from the multicast code, that would not be very obvious. Something needs to be done either way.
On Monday 07 December 2015 23:12:42 Linus Lüssing wrote:
On Sat, Nov 28, 2015 at 09:21:02AM +0100, Sven Eckelmann wrote:
mcast.mla_list is protected by tt.commit_lock (see batadv_mcast_mla_tt_add, batadv_mcast_mla_list_free and batadv_mcast_mla_tt_retract).
mcast.mla_list changes should be protected by the non-parallel code flow: During runtime, batadv_mcast_mla_tt_update() is only called from the self-rearming OGM scheduler thread - batadv_mcast_mla_tt_update() will never run more than once at the same time.
The second place for mcast.mla_list changes, batadv_mcast_free(), is called only on shutdown after the OGM scheduling thread was stopped.
The two functions with the lockdep assert are
* batadv_mcast_mla_list_free * batadv_mcast_mla_tt_retract
(batadv_mcast_mla_tt_add looks basically like batadv_mcast_mla_list_free)
The call graphs are attached and these graphs have (pure) starting nodes which are not only batadv_exit and batadv_iv_ogm_schedule. Parts of them look like they are only protected because of tt.commit_lock.
I don't think there should be such races regarding mcast.mla_list
- was something like that observed in the wild which lead to inserting
the lockdep-asserts?
We had multiple races and crashes which resulted in these asserts and locks+checks around list_del. The list_add modifications are still missing. And there are still other problems which are still open [1].
Kind regards, Sven
[1] e.g. https://www.open-mesh.org/issues/223
On Mon, Dec 14, 2015 at 07:56:19PM +0100, Sven Eckelmann wrote:
On Monday 07 December 2015 23:12:42 Linus Lüssing wrote:
On Sat, Nov 28, 2015 at 09:21:02AM +0100, Sven Eckelmann wrote:
mcast.mla_list is protected by tt.commit_lock (see batadv_mcast_mla_tt_add, batadv_mcast_mla_list_free and batadv_mcast_mla_tt_retract).
mcast.mla_list changes should be protected by the non-parallel code flow: During runtime, batadv_mcast_mla_tt_update() is only called from the self-rearming OGM scheduler thread - batadv_mcast_mla_tt_update() will never run more than once at the same time.
The second place for mcast.mla_list changes, batadv_mcast_free(), is called only on shutdown after the OGM scheduling thread was stopped.
The two functions with the lockdep assert are
- batadv_mcast_mla_list_free
- batadv_mcast_mla_tt_retract
(batadv_mcast_mla_tt_add looks basically like batadv_mcast_mla_list_free)
The call graphs are attached and these graphs have (pure) starting nodes which are not only batadv_exit and batadv_iv_ogm_schedule. Parts of them look like they are only protected because of tt.commit_lock.
Thanks for these pictures! (btw. which tool did you use for that?)
The two non-colliding paths I had in mind were batadv_iv_ogm_schedule() and batadv_mcast_free(), which looks like:
batadv_mesh_free() -> batadv_purge_outstanding_packets() -> cancel_delayed_work_sync() ! [...] -> batadv_mcast_free()
Which ensures that no batadv_iv_ogm_schedule() is running before calling batadv_mcast_free().
But I missed the path from batadv_update_min_mtu()... However, it should not race with batadv_mcast_free() either, which is called once the last hard-iface gets disabled:
batadv_hardif_disable_interface() -> batadv_purge_outstanding_packets() -> cancel_delayed_work_sync() ! -> dev_put(soft_iface) [ if last hard-iface, then soft-iface is out of scope for any new batadv_update_mtu() and gets freed: ] -> batadv_softif_free() -> batadv_mesh_free() -> batadv_mcast_free()
But with yet another path it is getting even more, rediculously complicated... Just proving that trying to keep a lock-less update for mla_list here is a bad, unmaintainable approach :).
So I'm definitely in favor of having some spinlock to refer to for mcast.mla_list update, even if it isn't necessary for batadv_mcast_free(). But I don't see a race for the current mla_list updates either (otherwise I'd need a specific, more verbose example, I guess...).
The question is, do we want to have Simon's patch for maint to trickle down to 4.3 (where the lockdep patch got added) or back to 3.15 (where multicast.c got added)?
For stable kernels, we need a specific, precise, reproducable issue to point to, right? (stable_kernel_rules.txt: 'It must fix a real bug that bothers people (not a, "This could be a problem..." type thing).'
Regards, Linus
On Tuesday 15 December 2015 14:15:33 Linus Lüssing wrote:
On Mon, Dec 14, 2015 at 07:56:19PM +0100, Sven Eckelmann wrote:
On Monday 07 December 2015 23:12:42 Linus Lüssing wrote:
On Sat, Nov 28, 2015 at 09:21:02AM +0100, Sven Eckelmann wrote:
mcast.mla_list is protected by tt.commit_lock (see batadv_mcast_mla_tt_add, batadv_mcast_mla_list_free and batadv_mcast_mla_tt_retract).
mcast.mla_list changes should be protected by the non-parallel code flow: During runtime, batadv_mcast_mla_tt_update() is only called from the self-rearming OGM scheduler thread - batadv_mcast_mla_tt_update() will never run more than once at the same time.
The second place for mcast.mla_list changes, batadv_mcast_free(), is called only on shutdown after the OGM scheduling thread was stopped.
The two functions with the lockdep assert are
- batadv_mcast_mla_list_free
- batadv_mcast_mla_tt_retract
(batadv_mcast_mla_tt_add looks basically like batadv_mcast_mla_list_free)
The call graphs are attached and these graphs have (pure) starting nodes which are not only batadv_exit and batadv_iv_ogm_schedule. Parts of them look like they are only protected because of tt.commit_lock.
Thanks for these pictures! (btw. which tool did you use for that?)
It is just doxygen with the dot (graphviz) enabled for all (also undocumented + local/static functions).
The two non-colliding paths I had in mind were batadv_iv_ogm_schedule() and batadv_mcast_free(), which looks like:
batadv_mesh_free() -> batadv_purge_outstanding_packets() -> cancel_delayed_work_sync() ! [...] -> batadv_mcast_free()
Which ensures that no batadv_iv_ogm_schedule() is running before calling batadv_mcast_free().
That seems to be right.
But I missed the path from batadv_update_min_mtu()... However, it should not race with batadv_mcast_free() either, which is called once the last hard-iface gets disabled:
batadv_hardif_disable_interface() -> batadv_purge_outstanding_packets() -> cancel_delayed_work_sync() ! -> dev_put(soft_iface) [ if last hard-iface, then soft-iface is out of scope for any new batadv_update_mtu() and gets freed: ] -> batadv_softif_free() -> batadv_mesh_free() -> batadv_mcast_free()
But with yet another path it is getting even more, rediculously complicated... Just proving that trying to keep a lock-less update for mla_list here is a bad, unmaintainable approach :).
Yes, and I am currently not trusting the reference/rcu code in batman-adv anymore. It seems to have a lot of backpointers (yes, loops are very bad for reference counting and the workarounds chosen in the code for that don't look very sane either). I actually don't want to check this mess to find out if there is a situation where were there can be an (incorrectly counted) reference somewhere back to it. Let us just assume that everything is fine here... better for everyones sanity.
So I'm definitely in favor of having some spinlock to refer to for mcast.mla_list update, even if it isn't necessary for batadv_mcast_free(). But I don't see a race for the current mla_list updates either (otherwise I'd need a specific, more verbose example, I guess...).
The assert was more for having a common lock for all accesses to this list (ogm_schedule, mtu change event, hardif add/del, softif free/add). This lock already existed indirectly and protected the updates. The free functions should have the same protection (especially when changes are done to the code in the future which might call these functions in a different context). So right now the patch from Simon only works around the lockdep warning and prepares the code for the "future". So I don't think it is required for - stable - especially when you are sure that there is no race between _free and _retract.
Kind regards, Sven
From: Simon Wunderlich simon@open-mesh.com
Signed-off-by: Simon Wunderlich simon@open-mesh.com --- net/batman-adv/bat_iv_ogm.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c index ad678f0..a594496 100644 --- a/net/batman-adv/bat_iv_ogm.c +++ b/net/batman-adv/bat_iv_ogm.c @@ -394,7 +394,14 @@ static u8 batadv_hop_penalty(u8 tq, const struct batadv_priv *bat_priv) return new_tq; }
-/* is there another aggregated packet here? */ +/** + * batadv_iv_ogm_aggr_packet - checks if there is another OGM attached + * @buff_pos: current position in the skb + * @packet_len: total length of the skb + * @tvlv_len: tvlv length of the previously considered OGM + * + * Return: true if there is enough space for another OGM, false otherwise. + */ static bool batadv_iv_ogm_aggr_packet(int buff_pos, int packet_len, __be16 tvlv_len) {
On Monday, November 23, 2015 19:57:21 Simon Wunderlich wrote:
From: Simon Wunderlich simon@open-mesh.com
Signed-off-by: Simon Wunderlich simon@open-mesh.com
net/batman-adv/bat_iv_ogm.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
Applied in revision 879ffa2.
Thanks, Marek
From: Simon Wunderlich simon@open-mesh.com
To allow future use of the window protected function with different maximum sequence numbers, add a parameter to set this value which was previously hardcoded. Another parameter added for future use is a flag to return whether the protection window has started.
While at it, also fix the kerneldoc.
Signed-off-by: Simon Wunderlich simon@open-mesh.com --- net/batman-adv/bat_iv_ogm.c | 3 ++- net/batman-adv/main.h | 3 +++ net/batman-adv/routing.c | 13 ++++++++++--- net/batman-adv/routing.h | 3 ++- 4 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c index a594496..12bdffa 100644 --- a/net/batman-adv/bat_iv_ogm.c +++ b/net/batman-adv/bat_iv_ogm.c @@ -1313,7 +1313,8 @@ batadv_iv_ogm_update_seqnos(const struct ethhdr *ethhdr, /* signalize caller that the packet is to be dropped. */ if (!hlist_empty(&orig_node->neigh_list) && batadv_window_protected(bat_priv, seq_diff, - &orig_ifinfo->batman_seqno_reset)) { + BATADV_TQ_LOCAL_WINDOW_SIZE, + &orig_ifinfo->batman_seqno_reset, NULL)) { ret = BATADV_PROTECTED; goto out; } diff --git a/net/batman-adv/main.h b/net/batman-adv/main.h index 5b79814..cb53827 100644 --- a/net/batman-adv/main.h +++ b/net/batman-adv/main.h @@ -35,6 +35,9 @@ /* Time To Live of broadcast messages */ #define BATADV_TTL 50
+/* maximum sequence number age of broadcast messages */ +#define BATADV_BCAST_MAX_AGE 64 + /* purge originators after time in seconds if no valid packet comes in * -> TODO: check influence on BATADV_TQ_LOCAL_WINDOW_SIZE */ diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c index 1fb1be3..0ba5993 100644 --- a/net/batman-adv/routing.c +++ b/net/batman-adv/routing.c @@ -146,23 +146,29 @@ out: * @bat_priv: the bat priv with all the soft interface information * @seq_num_diff: difference between the current/received sequence number and * the last sequence number + * @seq_old_max_diff: maximum age of sequence number not considered as restart * @last_reset: jiffies timestamp of the last reset, will be updated when reset * is detected + * @protection_started: is set to true if the protection window was started, + * doesn't change otherwise. * * Return: * 0 if the packet is to be accepted. * 1 if the packet is to be ignored. */ int batadv_window_protected(struct batadv_priv *bat_priv, s32 seq_num_diff, - unsigned long *last_reset) + s32 seq_old_max_diff, unsigned long *last_reset, + bool *protection_started) { - if (seq_num_diff <= -BATADV_TQ_LOCAL_WINDOW_SIZE || + if (seq_num_diff <= -seq_old_max_diff || seq_num_diff >= BATADV_EXPECTED_SEQNO_RANGE) { if (!batadv_has_timed_out(*last_reset, BATADV_RESET_PROTECTION_MS)) return 1;
*last_reset = jiffies; + if (protection_started) + *protection_started = true; batadv_dbg(BATADV_DBG_BATMAN, bat_priv, "old packet received, start protection\n"); } @@ -1073,7 +1079,8 @@ int batadv_recv_bcast_packet(struct sk_buff *skb,
/* check whether the packet is old and the host just restarted. */ if (batadv_window_protected(bat_priv, seq_diff, - &orig_node->bcast_seqno_reset)) + BATADV_BCAST_MAX_AGE, + &orig_node->bcast_seqno_reset, NULL)) goto spin_unlock;
/* mark broadcast in flood history, update window position diff --git a/net/batman-adv/routing.h b/net/batman-adv/routing.h index 204bbe4..5315fb3 100644 --- a/net/batman-adv/routing.h +++ b/net/batman-adv/routing.h @@ -52,6 +52,7 @@ batadv_find_router(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node, struct batadv_hard_iface *recv_if); int batadv_window_protected(struct batadv_priv *bat_priv, s32 seq_num_diff, - unsigned long *last_reset); + s32 seq_old_max_diff, unsigned long *last_reset, + bool *protection_started);
#endif /* _NET_BATMAN_ADV_ROUTING_H_ */
On Monday, November 23, 2015 19:57:22 Simon Wunderlich wrote:
From: Simon Wunderlich simon@open-mesh.com
To allow future use of the window protected function with different maximum sequence numbers, add a parameter to set this value which was previously hardcoded. Another parameter added for future use is a flag to return whether the protection window has started.
While at it, also fix the kerneldoc.
Signed-off-by: Simon Wunderlich simon@open-mesh.com
net/batman-adv/bat_iv_ogm.c | 3 ++- net/batman-adv/main.h | 3 +++ net/batman-adv/routing.c | 13 ++++++++++--- net/batman-adv/routing.h | 3 ++- 4 files changed, 17 insertions(+), 5 deletions(-)
Applied in revision 544f94c.
Thanks, Marek
b.a.t.m.a.n@lists.open-mesh.org