KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

List overview All Threads
Download

newer

older

[PATCH 1/2] batman-adv: Introduce...

[PATCH maint 1/2] batman-adv:...

syzbot

7 Oct 2019 7 Oct '19

10:01 a.m.

Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

--- This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Show replies by date

Marco Elver

7 Oct 7 Oct

10:04 a.m.

+RCU maintainers This might be a data-race in RCU itself.

On Mon, 7 Oct 2019 at 12:01, syzbot syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com wrote:

...

Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Paul E. McKenney

1:22 p.m.

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:

...

+RCU maintainers This might be a data-race in RCU itself.

Quite possibly. I will take a look, but there will be delays due to this week being bootcamp and all.

Thanx, Paul

...

On Mon, 7 Oct 2019 at 12:01, syzbot syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com wrote:

...
Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Steven Rostedt

1:34 p.m.

On Mon, 7 Oct 2019 12:04:16 +0200 Marco Elver elver@google.com wrote:

...

+RCU maintainers This might be a data-race in RCU itself.

...
write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244

Here we have:

raw_spin_lock_irqsave_rcu_node(rnp, flags); if (!(rnp->expmask & mask)) { raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return; } rnp->expmask &= ~mask; __rcu_report_exp_rnp(rnp, wake, flags); /* Releases rnp->lock. */

...

...
read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375

and here we have:

raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/* IPI the remaining CPUs for expedited quiescent state. */ for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

The write to rnp->expmask is done under the rnp->lock, but on the read side, that lock is released before the for loop. Should we have something like:

unsigned long expmask; [...]

expmask = rnp->expmask; raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/* IPI the remaining CPUs for expedited quiescent state. */ for_each_leaf_node_cpu_mask(rnp, cpu, expmask) {

-- Steve

Boqun Feng

1:43 p.m.

Hi Marco,

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:

...

+RCU maintainers This might be a data-race in RCU itself.

On Mon, 7 Oct 2019 at 12:01, syzbot syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com wrote:

...
Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375

This is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus(), the first loop is for collecting which CPU blocks current grace period (IOW, which CPU need to be sent an IPI to), and the second loop does the real work of sending IPI. The first loop is protected by proper lock (rcu node lock), so there is no race there. But the second one can't hold rcu node lock, because the IPI handler (rcu_exp_handler) needs to acquire the same lock, so rcu node lock has to be dropped before the second loop to avoid deadlock.

Now for the racy find_next_bit() on rnp->expmask:

1) if an extra bit appears: it's OK since there is checking on whether the bit exists in mask_ofl_ipi (the result of the first loop).

2) if a bit is missing: it will be problematic, because the second loop will skip the CPU, and the rest of the code will treat the CPU as offline but hasn't reported a quesient state, and the rcu_report_exp_cpu_mult() will report the qs for it, even though the CPU may currenlty run inside a RCU read-side critical section.

Note both "appears" and "missing" means some intermediate state of a plain unset for expmask contributed by compiler magic.

Please see below for a compile-test-only patch:

...

...
sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

Regards, Boqun

------------------->8 Subject: [PATCH] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

| BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult | | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: | rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 | rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 | rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 | rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 | __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 | rcu_read_unlock include/linux/rcupdate.h:645 [inline] | batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] | batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 | | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: | _find_next_bit lib/find_bit.c:39 [inline] | find_next_bit+0x57/0xe0 lib/find_bit.c:70 | sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 | sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 | rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] | wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

The root cause of this is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask without holding rcu_node's lock. This is by design, because the second loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler) may acquire the same rcu_node's lock. So the rcu_node's lock has to be dropped before the second loop.

The problem will occur when the normal unsetting of rcu_node::expmask results into some intermediate state (because it's a plain access), where an extra bit gets zeroed. The second loop will skip the corrensponding CPU, but treat it as offline and in quesient state. This will cause trouble because that CPU may be in a RCU read-side critical section.

To fix this, take a snapshot of mask_ofl_ipi, and make the second loop iterate on the snapshot's bits, as a result, the find_next_bit() of the second loop doesn't access any variables that may get changed in parallel, so the race is avoided.

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_exp.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..7f3e19d0275e 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -335,6 +335,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) unsigned long flags; unsigned long mask_ofl_test; unsigned long mask_ofl_ipi; + unsigned long mask_ofl_ipi_snap; int ret; struct rcu_exp_work *rewp = container_of(wp, struct rcu_exp_work, rew_work); @@ -371,13 +372,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) rnp->exp_tasks = rnp->blkd_tasks.next; raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

+ mask_ofl_ipi_snap = mask_ofl_ipi; /* IPI the remaining CPUs for expedited quiescent state. */ - for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) { + for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) { unsigned long mask = leaf_node_cpu_bit(rnp, cpu); struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);

- if (!(mask_ofl_ipi & mask)) - continue; retry_ipi: if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) { mask_ofl_test |= mask;

-- 2.23.0

Joel Fernandes

8 Oct 8 Oct

12:11 a.m.

On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:

...

Hi Marco,

Hi Boqun, Steve and Paul, fun times!

Marco, good catch ;-)

Some comments below:

...

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:

...
+RCU maintainers This might be a data-race in RCU itself.

On Mon, 7 Oct 2019 at 12:01, syzbot syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com wrote:

...
Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375

This is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus(), the first loop is for collecting which CPU blocks current grace period (IOW, which CPU need to be sent an IPI to), and the second loop does the real work of sending IPI. The first loop is protected by proper lock (rcu node lock), so there is no race there. But the second one can't hold rcu node lock, because the IPI handler (rcu_exp_handler) needs to acquire the same lock, so rcu node lock has to be dropped before the second loop to avoid deadlock.

Now for the racy find_next_bit() on rnp->expmask:

if an extra bit appears: it's OK since there is checking on whether

the bit exists in mask_ofl_ipi (the result of the first loop).

if a bit is missing: it will be problematic, because the second loop

will skip the CPU, and the rest of the code will treat the CPU as offline but hasn't reported a quesient state, and the rcu_report_exp_cpu_mult() will report the qs for it, even though the CPU may currenlty run inside a RCU read-side critical section.

Note both "appears" and "missing" means some intermediate state of a plain unset for expmask contributed by compiler magic.

Please see below for a compile-test-only patch:

...
...
sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

Regards, Boqun

------------------->8 Subject: [PATCH] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

| BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult | | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: | rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 | rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 | rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 | rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 | __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 | rcu_read_unlock include/linux/rcupdate.h:645 [inline] | batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] | batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 | | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: | _find_next_bit lib/find_bit.c:39 [inline] | find_next_bit+0x57/0xe0 lib/find_bit.c:70 | sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 | sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 | rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] | wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

The root cause of this is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask without holding rcu_node's lock. This is by design, because the second loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler) may acquire the same rcu_node's lock. So the rcu_node's lock has to be dropped before the second loop.

The problem will occur when the normal unsetting of rcu_node::expmask results into some intermediate state (because it's a plain access), where an extra bit gets zeroed. The second loop will skip the corrensponding CPU, but treat it as offline and in quesient state. This will cause trouble because that CPU may be in a RCU read-side critical section.

To fix this, take a snapshot of mask_ofl_ipi, and make the second loop iterate on the snapshot's bits, as a result, the find_next_bit() of the second loop doesn't access any variables that may get changed in parallel, so the race is avoided.

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Boqun Feng boqun.feng@gmail.com

kernel/rcu/tree_exp.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..7f3e19d0275e 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -335,6 +335,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) unsigned long flags; unsigned long mask_ofl_test; unsigned long mask_ofl_ipi;

unsigned long mask_ofl_ipi_snap; int ret; struct rcu_exp_work *rewp = container_of(wp, struct rcu_exp_work, rew_work);

@@ -371,13 +372,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) rnp->exp_tasks = rnp->blkd_tasks.next; raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

mask_ofl_ipi_snap = mask_ofl_ipi; /* IPI the remaining CPUs for expedited quiescent state. */

for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {

This looks good to me. Just a nit, I prefer if the comment to IPI the remaining CPUs is before the assignment to mask_ofl_ipi_snap since the new assignment is done for consumption by the for_each..(..) loop itself.

Steve's patch looks good as well and I was thinking along the same lines but Boqun's patch is slightly better because he doesn't need to snapshot exp_mask inside the locked section.

Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org

thanks,

- Joel

...

unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
if (!(mask_ofl_ipi & mask))
	continue;
retry_ipi: if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) { mask_ofl_test |= mask; -- 2.23.0

Paul E. McKenney

2:12 a.m.

On Mon, Oct 07, 2019 at 08:11:31PM -0400, Joel Fernandes wrote:

...

On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:

...
Hi Marco,

Hi Boqun, Steve and Paul, fun times!

Marco, good catch ;-)

Indeed! ;-)

...

Some comments below:

...
On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:

...
+RCU maintainers This might be a data-race in RCU itself.

On Mon, 7 Oct 2019 at 12:01, syzbot syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com wrote:

...
Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375

This is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus(), the first loop is for collecting which CPU blocks current grace period (IOW, which CPU need to be sent an IPI to), and the second loop does the real work of sending IPI. The first loop is protected by proper lock (rcu node lock), so there is no race there. But the second one can't hold rcu node lock, because the IPI handler (rcu_exp_handler) needs to acquire the same lock, so rcu node lock has to be dropped before the second loop to avoid deadlock.

Now for the racy find_next_bit() on rnp->expmask:

if an extra bit appears: it's OK since there is checking on whether

the bit exists in mask_ofl_ipi (the result of the first loop).

if a bit is missing: it will be problematic, because the second loop

will skip the CPU, and the rest of the code will treat the CPU as offline but hasn't reported a quesient state, and the rcu_report_exp_cpu_mult() will report the qs for it, even though the CPU may currenlty run inside a RCU read-side critical section.

Note both "appears" and "missing" means some intermediate state of a plain unset for expmask contributed by compiler magic.

Please see below for a compile-test-only patch:

...
...
sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

Regards, Boqun

------------------->8 Subject: [PATCH] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

| BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult | | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: | rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 | rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 | rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 | rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 | __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 | rcu_read_unlock include/linux/rcupdate.h:645 [inline] | batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] | batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 | | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: | _find_next_bit lib/find_bit.c:39 [inline] | find_next_bit+0x57/0xe0 lib/find_bit.c:70 | sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 | sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 | rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] | wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

The root cause of this is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask without holding rcu_node's lock. This is by design, because the second loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler) may acquire the same rcu_node's lock. So the rcu_node's lock has to be dropped before the second loop.

The problem will occur when the normal unsetting of rcu_node::expmask results into some intermediate state (because it's a plain access), where an extra bit gets zeroed. The second loop will skip the corrensponding CPU, but treat it as offline and in quesient state. This will cause trouble because that CPU may be in a RCU read-side critical section.

To fix this, take a snapshot of mask_ofl_ipi, and make the second loop iterate on the snapshot's bits, as a result, the find_next_bit() of the second loop doesn't access any variables that may get changed in parallel, so the race is avoided.

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Boqun Feng boqun.feng@gmail.com

kernel/rcu/tree_exp.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..7f3e19d0275e 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -335,6 +335,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) unsigned long flags; unsigned long mask_ofl_test; unsigned long mask_ofl_ipi;

unsigned long mask_ofl_ipi_snap; int ret; struct rcu_exp_work *rewp = container_of(wp, struct rcu_exp_work, rew_work);

@@ -371,13 +372,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) rnp->exp_tasks = rnp->blkd_tasks.next; raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

mask_ofl_ipi_snap = mask_ofl_ipi; /* IPI the remaining CPUs for expedited quiescent state. */

for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {

Why can't we just use mask_ofl_ipi? The bits removed are only those bits just now looked at, right? Also, the test of mask_ofl_ipi can be dropped, since that branch will never be taken, correct?

...

This looks good to me. Just a nit, I prefer if the comment to IPI the remaining CPUs is before the assignment to mask_ofl_ipi_snap since the new assignment is done for consumption by the for_each..(..) loop itself.

Steve's patch looks good as well and I was thinking along the same lines but Boqun's patch is slightly better because he doesn't need to snapshot exp_mask inside the locked section.

There are also similar lockless accesses to ->expmask in the stall-warning code.

...

Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org

But thank all three of you for looking this over! My original patch was overly ornate. ;-)

Thanx, Paul

...

thanks,

Joel

...
unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
if (!(mask_ofl_ipi & mask))
	continue;
retry_ipi: if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) { mask_ofl_test |= mask; -- 2.23.0

Boqun Feng

2:50 a.m.

On Mon, Oct 07, 2019 at 07:12:33PM -0700, Paul E. McKenney wrote:

...

On Mon, Oct 07, 2019 at 08:11:31PM -0400, Joel Fernandes wrote:

...
On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:

...
Hi Marco,

Hi Boqun, Steve and Paul, fun times!

Marco, good catch ;-)

Indeed! ;-)

[...]

...

...
...

mask_ofl_ipi_snap = mask_ofl_ipi; /* IPI the remaining CPUs for expedited quiescent state. */

for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {

Why can't we just use mask_ofl_ipi? The bits removed are only those bits just now looked at, right? Also, the test of mask_ofl_ipi can be dropped, since that branch will never be taken, correct?

You're correct. But I think we can further simplify the code a little bit so that we won't need to modify the mask_ofl_ipi:

In the second loop:

1) if the target CPU is online and response the IPI we do nothing.

2) if the target CPU is offline but it doesn't block current GP, we do nothing.

3) otherwise, the target CPU is offline and blocks current GP, we add the corresponding bit in mask_ofl_test.

Thoughts?

Please see the end of email for a patch.

...

...
This looks good to me. Just a nit, I prefer if the comment to IPI the remaining CPUs is before the assignment to mask_ofl_ipi_snap since the new assignment is done for consumption by the for_each..(..) loop itself.

Steve's patch looks good as well and I was thinking along the same lines but Boqun's patch is slightly better because he doesn't need to snapshot exp_mask inside the locked section.

There are also similar lockless accesses to ->expmask in the stall-warning code.

...
Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org

But thank all three of you for looking this over! My original patch was overly ornate. ;-)
					Thanx, Paul

-------------------->8 Subject: [PATCH v2] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

To fix this, make the second loop iterate on mask_ofl_ipi, as a result, the find_next_bit() of the second loop doesn't access any variables that may get changed in parallel, so the race is avoided. While we are at it, remove the unset of mask_ofl_ipi to improve the readiblity, because we can always use mask_ofl_test to record which CPU's QS should be reported.

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_exp.h | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..fb51752ac9a6 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -372,12 +372,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/* IPI the remaining CPUs for expedited quiescent state. */ - for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) { + for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) { unsigned long mask = leaf_node_cpu_bit(rnp, cpu); struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);

- if (!(mask_ofl_ipi & mask)) - continue; retry_ipi: if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) { mask_ofl_test |= mask; @@ -389,10 +387,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) } ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0); put_cpu(); - if (!ret) { - mask_ofl_ipi &= ~mask; + /* the CPU responses the IPI, and it will report QS itself */ + if (!ret) continue; - } + /* Failed, raced with CPU hotplug operation. */ raw_spin_lock_irqsave_rcu_node(rnp, flags); if ((rnp->qsmaskinitnext & mask) && @@ -403,13 +401,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) schedule_timeout_uninterruptible(1); goto retry_ipi; } - /* CPU really is offline, so we can ignore it. */ - if (!(rnp->expmask & mask)) - mask_ofl_ipi &= ~mask; + /* CPU really is offline, and we need its QS. */ + if (rnp->expmask & mask) + mask_ofl_test |= mask; raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } /* Report quiescent states for those that went offline. */ - mask_ofl_test |= mask_ofl_ipi; if (mask_ofl_test) rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false); }

-- 2.23.0

Paul E. McKenney

3:33 a.m.

On Tue, Oct 08, 2019 at 10:50:56AM +0800, Boqun Feng wrote:

...

On Mon, Oct 07, 2019 at 07:12:33PM -0700, Paul E. McKenney wrote:

...
On Mon, Oct 07, 2019 at 08:11:31PM -0400, Joel Fernandes wrote:

...
On Mon, Oct 07, 2019 at 09:43:04PM +0800, Boqun Feng wrote:

...
Hi Marco,

Hi Boqun, Steve and Paul, fun times!

Marco, good catch ;-)

Indeed! ;-)

[...]

...
...
...

mask_ofl_ipi_snap = mask_ofl_ipi; /* IPI the remaining CPUs for expedited quiescent state. */

for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi_snap) {

Why can't we just use mask_ofl_ipi? The bits removed are only those bits just now looked at, right? Also, the test of mask_ofl_ipi can be dropped, since that branch will never be taken, correct?

You're correct. But I think we can further simplify the code a little bit so that we won't need to modify the mask_ofl_ipi:

In the second loop:

if the target CPU is online and response the IPI we do nothing.

if the target CPU is offline but it doesn't block current GP, we do

nothing.

otherwise, the target CPU is offline and blocks current GP, we add

the corresponding bit in mask_ofl_test.

Thoughts?

Please see the end of email for a patch.

...
...
This looks good to me. Just a nit, I prefer if the comment to IPI the remaining CPUs is before the assignment to mask_ofl_ipi_snap since the new assignment is done for consumption by the for_each..(..) loop itself.

Steve's patch looks good as well and I was thinking along the same lines but Boqun's patch is slightly better because he doesn't need to snapshot exp_mask inside the locked section.

There are also similar lockless accesses to ->expmask in the stall-warning code.

...
Reviewed-by: Joel Fernandes (Google) joel@joelfernandes.org

But thank all three of you for looking this over! My original patch was overly ornate. ;-)
					Thanx, Paul
-------------------->8 Subject: [PATCH v2] rcu: exp: Avoid race on lockless rcu_node::expmask loop

KCSAN reported an issue:

| BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult | | write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: | rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 | rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 | rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 | rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 | __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 | rcu_read_unlock include/linux/rcupdate.h:645 [inline] | batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] | batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 | | read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: | _find_next_bit lib/find_bit.c:39 [inline] | find_next_bit+0x57/0xe0 lib/find_bit.c:70 | sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 | sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 | rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] | wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 | process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 | worker_thread+0xa0/0x800 kernel/workqueue.c:2415 | kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 | ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

The root cause of this is the second for_each_leaf_node_cpu_mask() loop in sync_rcu_exp_select_node_cpus() accesses the rcu_node::expmask without holding rcu_node's lock. This is by design, because the second loop may issue IPIs to other CPUs, and the IPI handler (rcu_exp_handler) may acquire the same rcu_node's lock. So the rcu_node's lock has to be dropped before the second loop.

The problem will occur when the normal unsetting of rcu_node::expmask results into some intermediate state (because it's a plain access), where an extra bit gets zeroed. The second loop will skip the corrensponding CPU, but treat it as offline and in quesient state. This will cause trouble because that CPU may be in a RCU read-side critical section.

To fix this, make the second loop iterate on mask_ofl_ipi, as a result, the find_next_bit() of the second loop doesn't access any variables that may get changed in parallel, so the race is avoided. While we are at it, remove the unset of mask_ofl_ipi to improve the readiblity, because we can always use mask_ofl_test to record which CPU's QS should be reported.

Good point on the second loop!

...

Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Boqun Feng boqun.feng@gmail.com

kernel/rcu/tree_exp.h | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..fb51752ac9a6 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -372,12 +372,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/* IPI the remaining CPUs for expedited quiescent state. */

for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) { unsigned long mask = leaf_node_cpu_bit(rnp, cpu); struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
if (!(mask_ofl_ipi & mask))
	continue;
retry_ipi: if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) { mask_ofl_test |= mask;

This part I have already on -rcu branch "dev".

...

@@ -389,10 +387,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) } ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0); put_cpu();
if (!ret) {
	mask_ofl_ipi &= ~mask;
/* the CPU responses the IPI, and it will report QS itself */
if (!ret)
continue;
}
/* Failed, raced with CPU hotplug operation. */ raw_spin_lock_irqsave_rcu_node(rnp, flags); if ((rnp->qsmaskinitnext & mask) &&

@@ -403,13 +401,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) schedule_timeout_uninterruptible(1); goto retry_ipi; }
/* CPU really is offline, so we can ignore it. */
if (!(rnp->expmask & mask))
	mask_ofl_ipi &= ~mask;
/* CPU really is offline, and we need its QS. */
if (rnp->expmask & mask)
	mask_ofl_test |= mask;
raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } /* Report quiescent states for those that went offline. */
mask_ofl_test |= mask_ofl_ipi; if (mask_ofl_test) rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);

}

Would you be willing to port this optimization on top of current -rcu branch "dev" with an suitably modified commit message?

Thanx, Paul

Boqun Feng

3:58 a.m.

On Mon, Oct 07, 2019 at 08:33:53PM -0700, Paul E. McKenney wrote: [...]

...

...
kernel/rcu/tree_exp.h | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..fb51752ac9a6 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -372,12 +372,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/* IPI the remaining CPUs for expedited quiescent state. */

for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {

for_each_leaf_node_cpu_mask(rnp, cpu, mask_ofl_ipi) { unsigned long mask = leaf_node_cpu_bit(rnp, cpu); struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
if (!(mask_ofl_ipi & mask))
	continue;
retry_ipi: if (rcu_dynticks_in_eqs_since(rdp, rdp->exp_dynticks_snap)) { mask_ofl_test |= mask;
This part I have already on -rcu branch "dev".

...
@@ -389,10 +387,10 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) } ret = smp_call_function_single(cpu, rcu_exp_handler, NULL, 0); put_cpu();
if (!ret) {
	mask_ofl_ipi &= ~mask;
/* the CPU responses the IPI, and it will report QS itself */
if (!ret)
continue;
}
/* Failed, raced with CPU hotplug operation. */ raw_spin_lock_irqsave_rcu_node(rnp, flags); if ((rnp->qsmaskinitnext & mask) &&

@@ -403,13 +401,12 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) schedule_timeout_uninterruptible(1); goto retry_ipi; }
/* CPU really is offline, so we can ignore it. */
if (!(rnp->expmask & mask))
	mask_ofl_ipi &= ~mask;
/* CPU really is offline, and we need its QS. */
if (rnp->expmask & mask)
	mask_ofl_test |= mask;
raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } /* Report quiescent states for those that went offline. */
mask_ofl_test |= mask_ofl_ipi; if (mask_ofl_test) rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);

}
Would you be willing to port this optimization on top of current -rcu branch "dev" with an suitably modified commit message?

Sure, will do ;-)

Regards, Boqun

...

					Thanx, Paul

Paul E. McKenney

1:47 a.m.

On Mon, Oct 07, 2019 at 12:04:16PM +0200, Marco Elver wrote:

...

+RCU maintainers This might be a data-race in RCU itself.

On Mon, 7 Oct 2019 at 12:01, syzbot syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com wrote:

...
Hello,

syzbot found the following crash on:

HEAD commit: b4bd9343 x86, kcsan: Enable KCSAN for x86 git tree: https://github.com/google/ktsan.git kcsan console output: https://syzkaller.appspot.com/x/log.txt?x=11edb20d600000 kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80 dashboard link: https://syzkaller.appspot.com/bug?extid=134336b86f728d6e55a0 compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com

================================================================== BUG: KCSAN: data-race in find_next_bit / rcu_report_exp_cpu_mult

write to 0xffffffff85a7f140 of 8 bytes by task 7 on cpu 0: rcu_report_exp_cpu_mult+0x4f/0xa0 kernel/rcu/tree_exp.h:244 rcu_report_exp_rdp+0x6c/0x90 kernel/rcu/tree_exp.h:254 rcu_preempt_deferred_qs_irqrestore+0x3bb/0x580 kernel/rcu/tree_plugin.h:475 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394 rcu_read_unlock include/linux/rcupdate.h:645 [inline] batadv_nc_purge_orig_hash net/batman-adv/network-coding.c:411 [inline] batadv_nc_worker+0x13a/0x390 net/batman-adv/network-coding.c:718 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffffffff85a7f140 of 8 bytes by task 7251 on cpu 1: _find_next_bit lib/find_bit.c:39 [inline] find_next_bit+0x57/0xe0 lib/find_bit.c:70 sync_rcu_exp_select_node_cpus+0x28e/0x510 kernel/rcu/tree_exp.h:375 sync_rcu_exp_select_cpus+0x30c/0x590 kernel/rcu/tree_exp.h:439 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:575 [inline] wait_rcu_exp_gp+0x25/0x40 kernel/rcu/tree_exp.h:589 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7251 Comm: kworker/1:4 Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: rcu_gp wait_rcu_exp_gp ==================================================================

This bug is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

And yes, ->expmask is accessed locklessly in a few places without READ_ONCE(). So does the following (untested) patch help?

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index d632cd0..bcd2a79 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -134,7 +134,7 @@ static void __maybe_unused sync_exp_reset_tree(void) rcu_for_each_node_breadth_first(rnp) { raw_spin_lock_irqsave_rcu_node(rnp, flags); WARN_ON_ONCE(rnp->expmask); - rnp->expmask = rnp->expmaskinit; + WRITE_ONCE(rnp->expmask, rnp->expmaskinit); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } } @@ -211,7 +211,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp, rnp = rnp->parent; raw_spin_lock_rcu_node(rnp); /* irqs already disabled */ WARN_ON_ONCE(!(rnp->expmask & mask)); - rnp->expmask &= ~mask; + WRITE_ONCE(rnp->expmask, rnp->expmask & ~mask); } }

@@ -241,7 +241,7 @@ static void rcu_report_exp_cpu_mult(struct rcu_node *rnp, raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return; } - rnp->expmask &= ~mask; + WRITE_ONCE(rnp->expmask, rnp->expmask & ~mask); __rcu_report_exp_rnp(rnp, wake, flags); /* Releases rnp->lock. */ }

@@ -332,6 +332,7 @@ static bool exp_funnel_lock(unsigned long s) static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) { int cpu; + unsigned long expmask_snap; unsigned long flags; unsigned long mask_ofl_test; unsigned long mask_ofl_ipi; @@ -369,10 +370,11 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) */ if (rcu_preempt_has_tasks(rnp)) rnp->exp_tasks = rnp->blkd_tasks.next; + expmask_snap = rnp->expmask; raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/* IPI the remaining CPUs for expedited quiescent state. */ - for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) { + for_each_leaf_node_cpu_mask(rnp, cpu, expmask_snap) { unsigned long mask = leaf_node_cpu_bit(rnp, cpu); struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);

@@ -491,7 +493,7 @@ static void synchronize_sched_expedited_wait(void) struct rcu_data *rdp;

mask = leaf_node_cpu_bit(rnp, cpu); - if (!(rnp->expmask & mask)) + if (!(READ_ONCE(rnp->expmask) & mask)) continue; ndetected++; rdp = per_cpu_ptr(&rcu_data, cpu); @@ -503,7 +505,8 @@ static void synchronize_sched_expedited_wait(void) } pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n", jiffies - jiffies_start, rcu_state.expedited_sequence, - rnp_root->expmask, ".T"[!!rnp_root->exp_tasks]); + READ_ONCE(rnp_root->expmask), + ".T"[!!rnp_root->exp_tasks]); if (ndetected) { pr_err("blocking rcu_node structures:"); rcu_for_each_node_breadth_first(rnp) { @@ -513,7 +516,7 @@ static void synchronize_sched_expedited_wait(void) continue; pr_cont(" l=%u:%d-%d:%#lx/%c", rnp->level, rnp->grplo, rnp->grphi, - rnp->expmask, + READ_ONCE(rnp->expmask), ".T"[!!rnp->exp_tasks]); } pr_cont("\n"); @@ -521,7 +524,7 @@ static void synchronize_sched_expedited_wait(void) rcu_for_each_leaf_node(rnp) { for_each_leaf_node_possible_cpu(rnp, cpu) { mask = leaf_node_cpu_bit(rnp, cpu); - if (!(rnp->expmask & mask)) + if (!(READ_ONCE(rnp->expmask) & mask)) continue; dump_cpu_task(cpu); }

1903

Age (days ago)

1904

Last active (days ago)

b.a.t.m.a.n@lists.open-mesh.org

10 comments

6 participants

tags (0)

participants (6)

Boqun Feng
Joel Fernandes
Marco Elver
Paul E. McKenney
Steven Rostedt
syzbot