On Wed, 7 Jun 2023 17:55:15 +0200 Simon Wunderlich wrote:
The reason for these issues is the lack of synchronization. Delayed work (batadv_dat_purge) schedules new timer/work while the device is being deleted. As the result new timer/delayed work is set after cancel_delayed_work_sync() was called. So after the device is freed the timer list contains pointer to already freed memory.
I guess this is better than status quo but is the fix really complete? We're still not preventing the timer / work from getting scheduled and staying alive after the netdev has been freed, right?
-----Original Message----- From: Jakub Kicinski kuba@kernel.org Sent: Wednesday, June 7, 2023 10:01 PM To: Simon Wunderlich sw@simonwunderlich.de Cc: davem@davemloft.net; netdev@vger.kernel.org; b.a.t.m.a.n@lists.open- mesh.org; Vladislav Efanov VEfanov@ispras.ru; stable@kernel.org; Sven Eckelmann sven@narfation.org Subject: Re: [PATCH 1/1] batman-adv: Broken sync while rescheduling delayed work
On Wed, 7 Jun 2023 17:55:15 +0200 Simon Wunderlich wrote:
The reason for these issues is the lack of synchronization. Delayed work (batadv_dat_purge) schedules new timer/work while the device is being deleted. As the result new timer/delayed work is set after cancel_delayed_work_sync() was called. So after the device is freed the timer list contains pointer to already freed memory.
I guess this is better than status quo but is the fix really complete? We're still not preventing the timer / work from getting scheduled and staying alive after the netdev has been freed, right?
Yea, I would expect some synchronization mechanism to ensure that after cancel_delayed_work_sync() you can't queue the work again.
I know for timers there is recently timer_shutdown_sync() which can be used to guarantee a timer can't re-arm at all, and its intended for some situations where there is a cyclic dependency...
Thanks, Jake
As far as I found the synchronization is provided by delayed work subsystem. It is based on the WORK_STRUCT_PENDING_BIT in work->data field.
The cancel_delayed_work_sync() atomically sets this bit and queue_delayed_work() checks it before scheduling new delayed work.
The problem is caused by the INIT_DELAYED_WORK() call inside batadv_dat_start_timer(). This call happens before the queue_delayed_work() call and clears this bit.
Best regards,
Vlad
On 08.06.2023 08:24, Keller, Jacob E wrote:
-----Original Message----- From: Jakub Kicinski kuba@kernel.org Sent: Wednesday, June 7, 2023 10:01 PM To: Simon Wunderlich sw@simonwunderlich.de Cc: davem@davemloft.net; netdev@vger.kernel.org; b.a.t.m.a.n@lists.open- mesh.org; Vladislav Efanov VEfanov@ispras.ru; stable@kernel.org; Sven Eckelmann sven@narfation.org Subject: Re: [PATCH 1/1] batman-adv: Broken sync while rescheduling delayed work
On Wed, 7 Jun 2023 17:55:15 +0200 Simon Wunderlich wrote:
The reason for these issues is the lack of synchronization. Delayed work (batadv_dat_purge) schedules new timer/work while the device is being deleted. As the result new timer/delayed work is set after cancel_delayed_work_sync() was called. So after the device is freed the timer list contains pointer to already freed memory.
I guess this is better than status quo but is the fix really complete? We're still not preventing the timer / work from getting scheduled and staying alive after the netdev has been freed, right?
Yea, I would expect some synchronization mechanism to ensure that after cancel_delayed_work_sync() you can't queue the work again.
I know for timers there is recently timer_shutdown_sync() which can be used to guarantee a timer can't re-arm at all, and its intended for some situations where there is a cyclic dependency...
Thanks, Jake
On Wed, 2023-06-07 at 22:01 -0700, Jakub Kicinski wrote:
On Wed, 7 Jun 2023 17:55:15 +0200 Simon Wunderlich wrote:
The reason for these issues is the lack of synchronization. Delayed work (batadv_dat_purge) schedules new timer/work while the device is being deleted. As the result new timer/delayed work is set after cancel_delayed_work_sync() was called. So after the device is freed the timer list contains pointer to already freed memory.
I guess this is better than status quo but is the fix really complete? We're still not preventing the timer / work from getting scheduled and staying alive after the netdev has been freed, right?
I *think* this specific use case does not expose such problem, as the delayed work is (AFAICS) scheduled only at device creation time and by the work itself, it should never be re-scheduled after cancel_delayed_work_sync()
Cheers,
Paolo
On Thursday, 8 June 2023 11:27:31 CEST Paolo Abeni wrote: [...]
We're still not preventing the timer / work from getting scheduled and staying alive after the netdev has been freed, right?
I *think* this specific use case does not expose such problem, as the delayed work is (AFAICS) scheduled only at device creation time and by the work itself, it should never be re-scheduled after cancel_delayed_work_sync()
Correct.
* batadv_dat_start_timer is the only thing scheduling it * batadv_dat_start_timer is called by:
- batadv_dat_purge (the worker rearming itself) - batadv_dat_init (when the interface is created)
Kind regards, Sven
b.a.t.m.a.n@lists.open-mesh.org