Hey Sven,
thanks for you analysis!!
On Mon, Sep 08, 2008 at 11:18:42PM +0200, Sven Eckelmann wrote:
Ok, I got the /proc/modules file now. Current situation is following: it crashes inside the the batman module add position 0x00000aa4
a60: 3c020000 lui v0,0x0 a64: 8c500024 lw s0,36(v0) a68: 24420024 addiu v0,v0,36 a6c: 12020014 beq s0,v0,ac0 <cleanup_module+0x610> a70: 3c040000 lui a0,0x0 a74: 3c050000 lui a1,0x0 a78: 3c020000 lui v0,0x0 a7c: 24840000 addiu a0,a0,0 a80: 24a50088 addiu a1,a1,136 a84: 24420000 addiu v0,v0,0 a88: 0040f809 jalr v0 a8c: 24060283 li a2,643 a90: 8e040004 lw a0,4(s0) a94: 8e030000 lw v1,0(s0) a98: 3c020010 lui v0,0x10 a9c: 34420100 ori v0,v0,0x100 aa0: 8e110008 lw s1,8(s0) aa4: ac830000 sw v1,0(a0) aa8: ae020000 sw v0,0(s0) aac: 3c020020 lui v0,0x20 ab0: 34420200 ori v0,v0,0x200 ab4: ac640004 sw a0,4(v1)
This is part of the compiled version of packet_recv_thread. Due the optimizations done I cannot say were exactly the problem lies.
I think the code of get_ip_addr() got inlined in packet_recv_thread and we need to search for the crash inside of it at list_del(&entry->list); I would also say that the really crash is inside __list_del where prev and next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of poison.h of the current linux kernel. You will notice that the values are 0x00100100 and 0x00200200 == address of the failed paging request. The list poison stuff will be done in in list_del after calling __list_del (it is the sequence lui, ori, sw in the asm snipped). So could it be that we have a poisened entry inside the list? This could for example happen when we get scheduled (please notice that the optimizer exchanged many instrictions) while another part of the program is deleting entries. I haven't checked the rest of the code if that really could happen, but that is my current idea.
Mhm, as far as i looked into the issue, there are the following points where free_client_list is accessed:
init_module() - INIT_LIST_HEAD() * called on startup
get_ip_addr() - list_del(): * "secured" with a hash_lock spinlock
cleanup_module() - list_del(): * only called when unloading the module
batgat_ioctl() - list_del() * from IOCREMDEV. This is called when batman shuts down.
packet_recv_thread - list_add(): * also secured in a hash_lock spinlock.
So it seems there should be no concurrency without user interaction (module or batman shutdown). But i don't have a good idea yet where the problem comes from ... :/
best regards, Simon