Hi Andrew,
On Fri, Jan 10, 2025 at 01:02:19AM -0800, Andrew Strohman wrote:
I would prefer when you would call cancel_work_sync when metric stuff should be stopped. I was expecting to see this somewhere around batadv_v_elp_iface_disable after the cancel_work_sync but it seems like it is missing there (or in a similar place)
I tried this:
diff --git a/net/batman-adv/bat_v_elp.c b/net/batman-adv/bat_v_elp.c index 1d704574..b35ded79 100644 --- a/net/batman-adv/bat_v_elp.c +++ b/net/batman-adv/bat_v_elp.c @@ -387,8 +387,20 @@ int batadv_v_elp_iface_enable(struct batadv_hard_iface *hard_iface) */ void batadv_v_elp_iface_disable(struct batadv_hard_iface *hard_iface) {
struct batadv_hardif_neigh_node *hardif_neigh;
cancel_delayed_work_sync(&hard_iface->bat_v.elp_wq);
rcu_read_lock();
hlist_for_each_entry_rcu(hardif_neigh,
&hard_iface->neigh_list, list) {
if (!kref_get_unless_zero(&hardif_neigh->refcount))
continue;
cancel_work_sync(&hardif_neigh->bat_v.metric_work);
batadv_hardif_neigh_put(hardif_neigh);
}
rcu_read_unlock();
dev_kfree_skb(hard_iface->bat_v.elp_skb); hard_iface->bat_v.elp_skb = NULL;
}
But it seems to cause a hang on reboot every once in a while. When the hang happens, I'm not able to trigger sysrq over serial.
Quickly looking at that I think that metric_work may need to sleep so calling cancel_work_sync() on this work does not seem safe while in rcu protected context.
I would try to frame the cancel_work_sync() call with rcu_read_unlock()/rcu_read_lock() as below:
rcu_read_unlock(); cancel_work_sync(...); rcu_read_lock(); batadv_hardif_neigh_put(...);
But be careful as batadv_hardif_neigh_put() could modify the list you are currently traversing. At first glance it seems safe to realease rcu read constraint to call cancel_work_sync() as long as you take it back before calling batadv_hardif_neigh_put(), but this could need more though on that.