Re: [PATCH] batman-adv: fix panic during interface removal

13 Jan 2025


      Hi Remi,
Thanks for your thoughts.
...
Quickly looking at that I think that metric_work may need to sleep so
calling cancel_work_sync() on this work does not seem safe while in rcu
protected context.
Thanks, you're right. I'm not supposed to call cancel_work_sync() unless
it's OK to sleep.
...
I would try to frame the cancel_work_sync() call with
rcu_read_unlock()/rcu_read_lock() as below:
rcu_read_unlock();
  cancel_work_sync(...);
  rcu_read_lock();
  batadv_hardif_neigh_put(...);
Despite the necessity to not sleep while in the RCU read side
critical section, this was not the cause of the hang.
I enabled hung task detection and was able see the call
stacks of all the tasks during the hang, which made the
issue clear.  batadv_v_elp_iface_disable() is called with
the RTNL lock held (via setlink). Calling cancel_work_sync()
will wait on batadv_v_elp_throughput_metric_update()
to complete, but batadv_v_elp_throughput_metric_update()
calls batadv_v_elp_get_throughput(), which acquires
the RTNL lock. So this is why I have deadlock: the work
cannot move forward until the RTNL is released, but the
caller of cancel_work_sync() holds the lock and won't release
it until the work finishes.
Sven, does this change your opinion about how this
issue should be handled? I'm not sure yet if it's safe
to temporarily release RTNL lock during the
duration of the cancel_work_sync() call.
...
But be careful as batadv_hardif_neigh_put() could modify the list you
are currently traversing.
Yes, batadv_hardif_neigh_put() can cause the reference count to go to zero.
If so, batadv_hardif_neigh_release() will be called, but it frees the
memory allocated for the neighbor with kfree_rcu(). Since the suggestion
was to acquire the RCU read lock before calling batadv_hardif_neigh_put,
I don't anticipate a use-after-free outcome.
...
At first glance it seems safe to realease
rcu read constraint to call cancel_work_sync() as long as you take it
back before calling batadv_hardif_neigh_put(), but this could need more
though on that.
Do you happen to know where I can find an example of this usage
pattern?
I wonder if unlocking and then re-locking will invalidate this [1]
guarantee:
"The reader is guaranteed to see all of the elements which were
added to the list before they acquired the rcu_read_lock() and are
still on the list when they drop the rcu_read_unlock()."
From testing, I don't see any issue when trying your suggestion,
but it would be reassuring to see an example of this elsewhere
in the code.
[1] https://docs.kernel.org/RCU/listRCU.html
Thanks,
Andy

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [PATCH] batman-adv: fix panic during interface removal