-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tuesday 09 September 2008 13:26:47 Simon Wunderlich wrote:
Hey Sven,
thanks for you analysis!!
[...]
Mhm, as far as i looked into the issue, there are the following points where free_client_list is accessed: [...] So it seems there should be no concurrency without user interaction (module or batman shutdown). But i don't have a good idea yet where the problem comes from ... :/
Yes, the idea of the race condition was stupid. So what is the real problem? Maybe a compiler bug? Let's check the assembler stuff:
The stuff around list_del in get_ip_addr: a90: lw a0,4(s0) /* a0 gets our prev pointer */ a94: lw v1,0(s0) /* v1 gets our next pointer */ a98: lui v0,0x10 /* load 0x100100 in v0 */ a9c: ori v0,v0,0x100 aa0: lw s1,8(s0) /* load pointer to gw_client in s1 */ aa4: sw v1,0(a0) /* store our next pointer in the next pointer of prev ****crash**** because 0x200200 or 0x0 was in our next pointer - why do we have a poisened next pointer when we are probably the first entry of the list -> is the list not correctly initialised or aren't we added correctly? */ aa8: sw v0,0(s0) /* store poison in next pointer */ aac: lui v0,0x20 /* load 0x200200 in v0 */ ab0: ori v0,v0,0x200 ab4: sw a0,4(v1) /* store our prev pointer in in the prev pointer of next */
The initialisation of the list is done in init_module. prev and next should be set to the list address. So let's search for it:
/* "zero" means here the position were our module was loaded to. */ 1374: lui a1,0x0 /* set a1 to "zero" */ 1378: addiu v1,a1,36 /* 36 is the position of the structure free_client_list, so we set v1 to it */ 137c: lui a0,0x0 /* set a0 to "zero" */ 1380: sw v0,28(a0) /* store pointer to wp_hash */ 1384: sw v1,36(a1) /* store free_client_list.next as free_client_list */ 1388: sw v1,4(v1) /* store free_client_list.prev as free_client_list */
So this looks good too. So when we have a entry, we must have added it somewhere. Lets take a look at packet_recv_thread again where the list_add is
/* v0 and v1 holds pointer to new allocated struct */ /* a3 holds "zero" - like t0 */ 9a8: sw s1,8(v0) /* store pointer to client_data in new data buffer */ 9ac: lw v0,36(a3) /* v0 gets free_client_list.next -> lets call it next_element */ /* shouldn't be another instruction between load and usage of the register? - like a nop */ 9b0: sw v0,0(v1) /* store next_element in tmp_entry.list.next */ 9b4: sw v1,4(v0) /* store pointer to tmp_entry next_element.prev */ 9b8: sw v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next */ 9bc: j 9d4 <cleanup_module+0x524> 9c0: sw t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this would not be executed in parallel, we would get wrong data here, but because we are using mips it must be executed */
So I cannot see anything special - only these two instructions. Maybe we should create a version with debug output after list_add with next and prev pointer of tmp_entry and free_client_list. The same with entry before calling list_del in get_ip_addr. This should be compiled for the nightwing and send to Outback Dingo so he can test it and send the kernel log after a crash to us. The problem is that I added some printks, checked the resulting output and noticed that this changed the output (so the interesting parts aren't there anymore). So if it will run without problems and prints some "use free client from list" in the kernel log then we should have fixed it by resorting some instructions. If not... we should try to find it by using the extra debugging output. If it runs without problems, please remove the printks around the list_add and if it still runs and prints some "use free client from list" now... do it the other way around (keep the printks around list_add and remove it before list_del).
...just my ideas to find the problem.
Best regards Sven Eckelmann