Andrew Lunn wrote:
On Thu, Dec 03, 2009 at 02:31:22AM +0100, Linus L??ssing wrote:
Hi Simon,
I just gave your patch a try on my laptop and could successfully, reproduceably crash my kernel in the following way:
This looks like a deadlock.
Simon: Did you try lockdep on this new code?
http://lwn.net/Articles/185666/
This should hopefully show you what locks are being taken in the wrong order. This is something i keep intending on running, but never get around to it. Now we have a probably deadlock, it might be enough reason for me to actually do it.
I am currently playing a little bit around and it seems that this tools will be extreme helpful. Output of the new code:
[ 330.076502] batman-adv:B.A.T.M.A.N. advanced 0.2.1-beta (compatibility version 8) loaded [ 345.187520] batman-adv:Adding interface: eth0 [ 345.187520] batman-adv:Interface activated: eth0 [ 401.149958] [ 401.149961] ================================= [ 401.151567] [ INFO: inconsistent lock state ] [ 401.153477] 2.6.32 #2 [ 401.153477] --------------------------------- [ 401.153477] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 401.153477] bat_events/5703 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 401.153477] (forw_bcast_list_lock){+.?...}, at: [<f8a02af2>] send_outstanding_bcast_packet+0x22/0x194 [batman_adv] [ 401.153477] {IN-SOFTIRQ-W} state was registered at: [ 401.153477] [<c108249d>] __lock_acquire+0x6aa/0x1aa9 [ 401.153477] [<c108392c>] lock_acquire+0x90/0xc3 [ 401.153477] [<c1a5ec60>] _spin_lock+0x3a/0xda [ 401.153477] [<f8a01cbb>] _add_bcast_packet_to_list+0x2e/0xfb [batman_adv] [ 401.153477] [<f8a01ec4>] add_bcast_packet_to_list+0x13c/0x150 [batman_adv] [ 401.153477] [<f8a05da1>] interface_tx+0x12f/0x316 [batman_adv] [ 401.153477] [<c189f6be>] dev_hard_start_xmit+0x439/0x593 [ 401.153477] [<c18bc193>] sch_direct_xmit+0xa0/0x2c2 [ 401.153477] [<c189ff38>] dev_queue_xmit+0x536/0x847 [ 401.153477] [<c18ab9e9>] neigh_resolve_output+0x3ff/0x4b3 [ 401.153477] [<c194409e>] ip6_output_finish+0x12d/0x1af [ 401.153477] [<c19473d8>] ip6_output2+0x312/0x336 [ 401.153477] [<c1948463>] ip6_output+0x1067/0x1081 [ 401.153477] [<c196fc56>] mld_sendpack+0x25b/0x400 [ 401.153477] [<c197144f>] mld_ifc_timer_expire+0x41a/0x4a4 [ 401.153477] [<c105332a>] run_timer_softirq+0x274/0x355 [ 401.153477] [<c104bf63>] __do_softirq+0xc0/0x1f8 [ 401.153477] irq event stamp: 3945 [ 401.153477] hardirqs last enabled at (3945): [<c1a5e82b>] _spin_unlock_irq+0x5a/0x7a [ 401.153477] hardirqs last disabled at (3944): [<c1a5ee16>] _spin_lock_irq+0x29/0xfc [ 401.153477] softirqs last enabled at (3922): [<c104c085>] __do_softirq+0x1e2/0x1f8 [ 401.153477] softirqs last disabled at (3913): [<c100654e>] do_softirq+0xad/0x1cf [ 401.153477] [ 401.153477] other info that might help us debug this: [ 401.153477] 2 locks held by bat_events/5703: [ 401.153477] #0: (bat_events){+.+...}, at: [<c105fc73>] worker_thread+0x256/0x4a9 [ 401.153477] #1: (&(&forw_packet->delayed_work)->work){+.+...}, at: [<c105fc93>] worker_thread+0x276/0x4a9 [ 401.153477] [ 401.153477] stack backtrace: [ 401.153477] Pid: 5703, comm: bat_events Not tainted 2.6.32 #2 [ 401.153477] Call Trace: [ 401.153477] [<c1a5a8cd>] ? printk+0x2b/0x4e [ 401.153477] [<c107feab>] valid_state+0x252/0x290 [ 401.153477] [<c10800b6>] mark_lock+0x1cd/0x473 [ 401.153477] [<c1081323>] ? check_usage_backwards+0x0/0xcc [ 401.153477] [<c10825da>] __lock_acquire+0x7e7/0x1aa9 [ 401.153477] [<c107ee0d>] ? save_trace+0x45/0x128 [ 401.153477] [<c107efc9>] ? add_lock_to_list+0xd9/0x133 [ 401.153477] [<c10835cb>] ? __lock_acquire+0x17d8/0x1aa9 [ 401.153477] [<c1083764>] ? __lock_acquire+0x1971/0x1aa9 [ 401.153477] [<c105fc93>] ? worker_thread+0x276/0x4a9 [ 401.153477] [<c108392c>] lock_acquire+0x90/0xc3 [ 401.153477] [<f8a02af2>] ? send_outstanding_bcast_packet+0x22/0x194 [batman_adv] [ 401.153477] [<c1a5ec60>] _spin_lock+0x3a/0xda [ 401.153477] [<f8a02af2>] ? send_outstanding_bcast_packet+0x22/0x194 [batman_adv] [ 401.153477] [<f8a02af2>] send_outstanding_bcast_packet+0x22/0x194 [batman_adv] [ 401.153477] [<c105fcce>] worker_thread+0x2b1/0x4a9 [ 401.153477] [<c105fc93>] ? worker_thread+0x276/0x4a9 [ 401.153477] [<f8a02ad0>] ? send_outstanding_bcast_packet+0x0/0x194 [batman_adv] [ 401.153477] [<c1065cbf>] ? autoremove_wake_function+0x0/0x5d [ 401.153477] [<c105fa1d>] ? worker_thread+0x0/0x4a9 [ 401.153477] [<c1065924>] kthread+0xa6/0xb9 [ 401.153477] [<c106587e>] ? kthread+0x0/0xb9 [ 401.153477] [<c1004c07>] kernel_thread_helper+0x7/0x10 [ 411.264212] bat0: no IPv6 routers present
And after a while (adding and removing some nodes from my virtual network we also get something like Linus problem - but with worker thread included and without free_orig).
[ 1903.748085] BUG: soft lockup - CPU#0 stuck for 61s! [bat_events:5703] [ 1903.748109] Modules linked in: batman_adv [ 1903.748109] irq event stamp: 3945 [ 1903.748109] hardirqs last enabled at (3945): [<c1a5e82b>] _spin_unlock_irq+0x5a/0x7a [ 1903.748109] hardirqs last disabled at (3944): [<c1a5ee16>] _spin_lock_irq+0x29/0xfc [ 1903.748109] softirqs last enabled at (3922): [<c104c085>] __do_softirq+0x1e2/0x1f8 [ 1903.748109] softirqs last disabled at (3913): [<c100654e>] do_softirq+0xad/0x1cf [ 1903.748109] [ 1903.748109] Pid: 5703, comm: bat_events Not tainted (2.6.32 #2) [ 1903.748109] EIP: 0060:[<c12be8ab>] EFLAGS: 00000246 CPU: 0 [ 1903.748109] EIP is at delay_tsc+0x17/0xb1 [ 1903.748109] EAX: 00000001 EBX: 00000001 ECX: 67d98e54 EDX: 00001f2d [ 1903.748109] ESI: 00000000 EDI: f6cf50b4 EBP: f206be84 ESP: f206be80 [ 1903.748109] DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 [ 1903.748109] CR0: 8005003b CR2: 080ef480 CR3: 37bcc000 CR4: 000006f0 [ 1903.748109] DR0: c1041443 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 1903.748109] DR6: ffff0ff0 DR7: 00000400 [ 1903.748109] Call Trace: [ 1903.748109] [<c12be979>] __delay+0x17/0x27 [ 1903.748109] [<c12dc3e6>] _raw_spin_lock+0x164/0x1fc [ 1903.748109] [<c1a5ecc5>] _spin_lock+0x9f/0xda [ 1903.748109] [<f8a0bf44>] purge_orig+0x73/0x3c7 [batman_adv] [ 1903.748109] [<c1080bae>] ? trace_hardirqs_on+0x27/0x37 [ 1903.748109] [<c1a5e82b>] ? _spin_unlock_irq+0x5a/0x7a [ 1903.748109] [<c1050032>] ? __register_sysctl_paths+0x1c1/0x428 [ 1903.748109] [<c105fcce>] worker_thread+0x2b1/0x4a9 [ 1903.748109] [<c105fc93>] ? worker_thread+0x276/0x4a9 [ 1903.748109] [<f8a0bed1>] ? purge_orig+0x0/0x3c7 [batman_adv] [ 1903.748109] [<c1065cbf>] ? autoremove_wake_function+0x0/0x5d [ 1903.748109] [<c105fa1d>] ? worker_thread+0x0/0x4a9 [ 1903.748109] [<c1065924>] kthread+0xa6/0xb9 [ 1903.748109] [<c106587e>] ? kthread+0x0/0xb9 [ 1903.748109] [<c1004c07>] kernel_thread_helper+0x7/0x10 [ 1969.244089] BUG: soft lockup - CPU#0 stuck for 61s! [bat_events:5703] [ 1969.244089] Modules linked in: batman_adv [ 1969.244089] irq event stamp: 3945 [ 1969.244089] hardirqs last enabled at (3945): [<c1a5e82b>] _spin_unlock_irq+0x5a/0x7a [ 1969.244089] hardirqs last disabled at (3944): [<c1a5ee16>] _spin_lock_irq+0x29/0xfc [ 1969.244089] softirqs last enabled at (3922): [<c104c085>] __do_softirq+0x1e2/0x1f8 [ 1969.244089] softirqs last disabled at (3913): [<c100654e>] do_softirq+0xad/0x1cf [ 1969.244089] [ 1969.244089] Pid: 5703, comm: bat_events Not tainted (2.6.32 #2) [ 1969.244089] EIP: 0060:[<c12be8df>] EFLAGS: 00000246 CPU: 0 [ 1969.244089] EIP is at delay_tsc+0x4b/0xb1 [ 1969.244089] EAX: 15363869 EBX: 00000001 ECX: 15363869 EDX: 00001f4f [ 1969.244089] ESI: 00000000 EDI: f6cf50b4 EBP: f206be84 ESP: f206be80 [ 1969.244089] DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 [ 1969.244089] CR0: 8005003b CR2: 080ef480 CR3: 37bcc000 CR4: 000006f0 [ 1969.244089] DR0: c1041443 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 1969.244089] DR6: ffff0ff0 DR7: 00000400 [ 1969.244089] Call Trace: [ 1969.244089] [<c12be979>] __delay+0x17/0x27 [ 1969.244089] [<c12dc3e6>] _raw_spin_lock+0x164/0x1fc [ 1969.244089] [<c1a5ecc5>] _spin_lock+0x9f/0xda [ 1969.244089] [<f8a0bf44>] purge_orig+0x73/0x3c7 [batman_adv] [ 1969.244089] [<c1080bae>] ? trace_hardirqs_on+0x27/0x37 [ 1969.244089] [<c1a5e82b>] ? _spin_unlock_irq+0x5a/0x7a [ 1969.244089] [<c1050032>] ? __register_sysctl_paths+0x1c1/0x428 [ 1969.244089] [<c105fcce>] worker_thread+0x2b1/0x4a9 [ 1969.244089] [<c105fc93>] ? worker_thread+0x276/0x4a9 [ 1969.244089] [<f8a0bed1>] ? purge_orig+0x0/0x3c7 [batman_adv] [ 1969.244089] [<c1065cbf>] ? autoremove_wake_function+0x0/0x5d [ 1969.244089] [<c105fa1d>] ? worker_thread+0x0/0x4a9 [ 1969.244089] [<c1065924>] kthread+0xa6/0xb9 [ 1969.244089] [<c106587e>] ? kthread+0x0/0xb9 [ 1969.244089] [<c1004c07>] kernel_thread_helper+0x7/0x10 [ 1972.181954] BUG: spinlock lockup on CPU#0, bat_events/5703, f6cf50b4 [ 1972.183041] Pid: 5703, comm: bat_events Not tainted 2.6.32 #2 [ 1972.184006] Call Trace: [ 1972.184627] [<c1a5a8cd>] ? printk+0x2b/0x4e [ 1972.185362] [<c12dc458>] _raw_spin_lock+0x1d6/0x1fc [ 1972.186208] [<c1a5ecc5>] _spin_lock+0x9f/0xda [ 1972.186973] [<f8a0bf44>] purge_orig+0x73/0x3c7 [batman_adv] [ 1972.187934] [<c1080bae>] ? trace_hardirqs_on+0x27/0x37 [ 1972.189019] [<c1a5e82b>] ? _spin_unlock_irq+0x5a/0x7a [ 1972.189914] [<c1050032>] ? __register_sysctl_paths+0x1c1/0x428 [ 1972.190917] [<c105fcce>] worker_thread+0x2b1/0x4a9 [ 1972.191746] [<c105fc93>] ? worker_thread+0x276/0x4a9 [ 1972.192810] [<f8a0bed1>] ? purge_orig+0x0/0x3c7 [batman_adv] [ 1972.193798] [<c1065cbf>] ? autoremove_wake_function+0x0/0x5d [ 1972.194773] [<c105fa1d>] ? worker_thread+0x0/0x4a9 [ 1972.195595] [<c1065924>] kthread+0xa6/0xb9 [ 1972.196494] [<c106587e>] ? kthread+0x0/0xb9 [ 1972.197323] [<c1004c07>] kernel_thread_helper+0x7/0x10
And here some recursive locking which was found by Linus before:
[ 1848.122378] ============================================= [ 1848.123961] [ INFO: possible recursive locking detected ] [ 1848.124440] 2.6.32 #2 [ 1848.124440] --------------------------------------------- [ 1848.124440] bat_events/5424 is trying to acquire lock: [ 1848.124440] (&orig_node->lock){+.+...}, at: [<f8a0bd4a>] free_orig_node+0x3c/0xeb [batman_adv] [ 1848.124440] [ 1848.124440] but task is already holding lock: [ 1848.124440] (&orig_node->lock){+.+...}, at: [<f8a0c002>] purge_orig+0xa7/0x4a5 [batman_adv] [ 1848.124440] [ 1848.124440] other info that might help us debug this: [ 1848.124440] 3 locks held by bat_events/5424: [ 1848.124440] #0: (bat_events){+.+...}, at: [<c105fc73>] worker_thread+0x256/0x4a9 [ 1848.124440] #1: ((purge_orig_wq).work){+.+...}, at: [<c105fc93>] worker_thread+0x276/0x4a9 [ 1848.124440] #2: (&orig_node->lock){+.+...}, at: [<f8a0c002>] purge_orig+0xa7/0x4a5 [batman_adv] [ 1848.124440] [ 1848.124440] stack backtrace: [ 1848.124440] Pid: 5424, comm: bat_events Not tainted 2.6.32 #2 [ 1848.124440] Call Trace: [ 1848.124440] [<c1a5a8cd>] ? printk+0x2b/0x4e [ 1848.124440] [<c108313c>] __lock_acquire+0x1349/0x1aa9 [ 1848.124440] [<c1043245>] ? release_console_sem+0x30b/0x359 [ 1848.124440] [<c108392c>] lock_acquire+0x90/0xc3 [ 1848.124440] [<f8a0bd4a>] ? free_orig_node+0x3c/0xeb [batman_adv] [ 1848.124440] [<c1a5ec60>] _spin_lock+0x3a/0xda [ 1848.124440] [<f8a0bd4a>] ? free_orig_node+0x3c/0xeb [batman_adv] [ 1848.124440] [<f8a0bd4a>] free_orig_node+0x3c/0xeb [batman_adv] [ 1848.124440] [<f8a0c325>] purge_orig+0x3ca/0x4a5 [batman_adv] [ 1848.124440] [<c106e623>] ? sched_clock_cpu+0x19e/0x1bf [ 1848.124440] [<c107e325>] ? trace_hardirqs_off+0x27/0x37 [ 1848.124440] [<c106e6c6>] ? cpu_clock+0x82/0xcf [ 1848.124440] [<c1050032>] ? __register_sysctl_paths+0x1c1/0x428 [ 1848.124440] [<c105fcce>] worker_thread+0x2b1/0x4a9 [ 1848.124440] [<c105fc93>] ? worker_thread+0x276/0x4a9 [ 1848.124440] [<f8a0bf5b>] ? purge_orig+0x0/0x4a5 [batman_adv] [ 1848.124440] [<c1065cbf>] ? autoremove_wake_function+0x0/0x5d [ 1848.124440] [<c105fa1d>] ? worker_thread+0x0/0x4a9 [ 1848.124440] [<c1065924>] kthread+0xa6/0xb9 [ 1848.124440] [<c106587e>] ? kthread+0x0/0xb9 [ 1848.124440] [<c1004c07>] kernel_thread_helper+0x7/0x10
I created a kernel which is mostly compatible with my uml image. So if anyone wants to try it, I am using a i386 kernel with kvm and use
$ kvm -m 1024 -kernel linux-2.6.32.qemu -drive file=root.qemu,if=virtio -serial stdio -net nic,macaddr=02:ca:ff:ee:ba:43,model=virtio,vlan=3 -net tap,ifname=tap3,vlan=3,script=no
to test it (all virtual devices are managed inside a bridge managed by me). I am a little bit confused why the spinlock deadlocks because it should be an unicore processor build - but maybe it has to do with lockdep.
Best regards, Sven