see pastebin
http://www.pastebin.ca/1194874
pertinent info dmesg | grep 'batgat loaded' batgat: [init_module:96] batgat loaded rv1025 uname -a Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown root@nightwing:~# batmand -v WARNING: You are using the unstable batman branch. If you are interested in *using* batman get the latest stable release ! B.A.T.M.A.N. 0.3-beta (compatibility version 5) lsmod
Module Size Used by Tainted: P sch_htb 14048 2 ath_ahb 103616 0 wlan_xauth 480 0 wlan_wep 4000 0 wlan_tkip 9856 0 wlan_ccmp 5440 2 wlan_acl 1920 0 ath_rate_minstrel 8352 1 ath_hal 136832 3 ath_ahb,ath_rate_minstrel wlan_scan_sta 8768 1 wlan_scan_ap 6656 0 wlan 152464 10 ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap batgat 10944 1 ipt_iprange 672 0 ipt_TOS 832 0 ipt_TTL 928 0 xt_MARK 960 3 ipt_ECN 1472 0 xt_CLASSIFY 640 0 ipt_ttl 704 0 ipt_tos 544 0 ipt_time 1568 0 xt_tcpmss 1088 0 xt_statistic 832 0 xt_mark 672 7 xt_mac 736 3 xt_length 736 0 ipt_ecn 1024 0 xt_DSCP 1056 0 xt_dscp 832 0 imq 2096 0 ipt_IMQ 672 2 xt_string 896 0 xt_layer7 9840 0 ipt_ipp2p 6784 0 ipt_LOG 4640 0 xt_CHAOS 1792 0 xt_DELUDE 2624 1 xt_TARPIT 2816 1 xt_quota 800 0 xt_portscan 2016 0 xt_pkttype 704 0 xt_physdev 1488 0 ipt_owner 800 0 iptable_raw 832 0 xt_NOTRACK 832 0 xt_CONNMARK 1088 0 ipt_recent 4992 0 xt_helper 992 0 xt_conntrack 1312 0 xt_connmark 832 0 xt_connbytes 1312 0 tun 6592 0
On Friday 05 September 2008 17:01:24 Outback Dingo wrote:
[...] http://www.pastebin.ca/1194874
pertinent info dmesg | grep 'batgat loaded' batgat: [init_module:96] batgat loaded rv1025 uname -a Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown root@nightwing:~# batmand -v WARNING: You are using the unstable batman branch. If you are interested in *using* batman get the latest stable release ! B.A.T.M.A.N. 0.3-beta (compatibility version 5) [..]
Thx for your report. I looked a little bit at the the call stack but it is really hard to guess the functions when you don't have the symbol table. It looks a little bit like most of the functions are kernel thread/scheduling related. Only one function <c00c8650> is from a module. Can you please send the /proc/modules file so we can find in which module it is so we can check if it could be a batgat related. Further investigations could be done by checking the symbols of batgat.ko with nm - but this would not help very much because it is stripped down. Maybe someone else has a good idea.
Now to something you said on the nightwing mailing list:
i think the version of batgat being used has a bug as confirmed by someone in the batman irc room, note only gateways have this issue, clients do not crash at all.
If you mean me (Lazhur on irc) then I have to say that I am not a developer nor a spokesman of b.a.t.m.a.n. and I never confirmed that it is batgat related - only that something crashed/lock up. I cannot find any crash related bug fixes in the batgat trunk directory - so it doesn't seem to be a known problem (if it is a batgat problem at all).
Best regards Sven Eckelmann
Ok, I got the /proc/modules file now. Current situation is following: it crashes inside the the batman module add position 0x00000aa4
a60: 3c020000 lui v0,0x0 a64: 8c500024 lw s0,36(v0) a68: 24420024 addiu v0,v0,36 a6c: 12020014 beq s0,v0,ac0 <cleanup_module+0x610> a70: 3c040000 lui a0,0x0 a74: 3c050000 lui a1,0x0 a78: 3c020000 lui v0,0x0 a7c: 24840000 addiu a0,a0,0 a80: 24a50088 addiu a1,a1,136 a84: 24420000 addiu v0,v0,0 a88: 0040f809 jalr v0 a8c: 24060283 li a2,643 a90: 8e040004 lw a0,4(s0) a94: 8e030000 lw v1,0(s0) a98: 3c020010 lui v0,0x10 a9c: 34420100 ori v0,v0,0x100 aa0: 8e110008 lw s1,8(s0) aa4: ac830000 sw v1,0(a0) aa8: ae020000 sw v0,0(s0) aac: 3c020020 lui v0,0x20 ab0: 34420200 ori v0,v0,0x200 ab4: ac640004 sw a0,4(v1)
This is part of the compiled version of packet_recv_thread. Due the optimizations done I cannot say were exactly the problem lies.
I think the code of get_ip_addr() got inlined in packet_recv_thread and we need to search for the crash inside of it at list_del(&entry->list); I would also say that the really crash is inside __list_del where prev and next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of poison.h of the current linux kernel. You will notice that the values are 0x00100100 and 0x00200200 == address of the failed paging request. The list poison stuff will be done in in list_del after calling __list_del (it is the sequence lui, ori, sw in the asm snipped). So could it be that we have a poisened entry inside the list? This could for example happen when we get scheduled (please notice that the optimizer exchanged many instrictions) while another part of the program is deleting entries. I haven't checked the rest of the code if that really could happen, but that is my current idea.
So for better readability the callstack: - packet_recv_thread - get_ip_addr from gateway.c:401 - list_del from gateway.c:645 - __list_del
Best regards Sven Eckelmann
On Monday 08 September 2008 23:18:42 Sven Eckelmann wrote:
Ok, I got the /proc/modules file now. Current situation is following: it crashes inside the the batman module add position 0x00000aa4 [..]
I got the System.map right now. So we can convert the kernel oops into something more readable. It doesn't help that much now, but... just for sake of completeness
END_OF_CODE+3fe1d020 is the batgat.ko when we search for c00c8650 inside /proc/modules
Best regards Sven Eckelmann
On Monday 08 September 2008 23:45:55 Sven Eckelmann wrote:
I got the System.map right now. So we can convert the kernel oops into something more readable. It doesn't help that much now, but... just for sake of completeness
Sry, forgot the second oops with the interesting address of the paging failure.
Best regards Sven Eckelmann
Hey Sven,
thanks for you analysis!!
On Mon, Sep 08, 2008 at 11:18:42PM +0200, Sven Eckelmann wrote:
Ok, I got the /proc/modules file now. Current situation is following: it crashes inside the the batman module add position 0x00000aa4
a60: 3c020000 lui v0,0x0 a64: 8c500024 lw s0,36(v0) a68: 24420024 addiu v0,v0,36 a6c: 12020014 beq s0,v0,ac0 <cleanup_module+0x610> a70: 3c040000 lui a0,0x0 a74: 3c050000 lui a1,0x0 a78: 3c020000 lui v0,0x0 a7c: 24840000 addiu a0,a0,0 a80: 24a50088 addiu a1,a1,136 a84: 24420000 addiu v0,v0,0 a88: 0040f809 jalr v0 a8c: 24060283 li a2,643 a90: 8e040004 lw a0,4(s0) a94: 8e030000 lw v1,0(s0) a98: 3c020010 lui v0,0x10 a9c: 34420100 ori v0,v0,0x100 aa0: 8e110008 lw s1,8(s0) aa4: ac830000 sw v1,0(a0) aa8: ae020000 sw v0,0(s0) aac: 3c020020 lui v0,0x20 ab0: 34420200 ori v0,v0,0x200 ab4: ac640004 sw a0,4(v1)
This is part of the compiled version of packet_recv_thread. Due the optimizations done I cannot say were exactly the problem lies.
I think the code of get_ip_addr() got inlined in packet_recv_thread and we need to search for the crash inside of it at list_del(&entry->list); I would also say that the really crash is inside __list_del where prev and next will be set. To check it, look at LIST_POISON1 and LIST_POISON1 inside of poison.h of the current linux kernel. You will notice that the values are 0x00100100 and 0x00200200 == address of the failed paging request. The list poison stuff will be done in in list_del after calling __list_del (it is the sequence lui, ori, sw in the asm snipped). So could it be that we have a poisened entry inside the list? This could for example happen when we get scheduled (please notice that the optimizer exchanged many instrictions) while another part of the program is deleting entries. I haven't checked the rest of the code if that really could happen, but that is my current idea.
Mhm, as far as i looked into the issue, there are the following points where free_client_list is accessed:
init_module() - INIT_LIST_HEAD() * called on startup
get_ip_addr() - list_del(): * "secured" with a hash_lock spinlock
cleanup_module() - list_del(): * only called when unloading the module
batgat_ioctl() - list_del() * from IOCREMDEV. This is called when batman shuts down.
packet_recv_thread - list_add(): * also secured in a hash_lock spinlock.
So it seems there should be no concurrency without user interaction (module or batman shutdown). But i don't have a good idea yet where the problem comes from ... :/
best regards, Simon
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tuesday 09 September 2008 13:26:47 Simon Wunderlich wrote:
Hey Sven,
thanks for you analysis!!
[...]
Mhm, as far as i looked into the issue, there are the following points where free_client_list is accessed: [...] So it seems there should be no concurrency without user interaction (module or batman shutdown). But i don't have a good idea yet where the problem comes from ... :/
Yes, the idea of the race condition was stupid. So what is the real problem? Maybe a compiler bug? Let's check the assembler stuff:
The stuff around list_del in get_ip_addr: a90: lw a0,4(s0) /* a0 gets our prev pointer */ a94: lw v1,0(s0) /* v1 gets our next pointer */ a98: lui v0,0x10 /* load 0x100100 in v0 */ a9c: ori v0,v0,0x100 aa0: lw s1,8(s0) /* load pointer to gw_client in s1 */ aa4: sw v1,0(a0) /* store our next pointer in the next pointer of prev ****crash**** because 0x200200 or 0x0 was in our next pointer - why do we have a poisened next pointer when we are probably the first entry of the list -> is the list not correctly initialised or aren't we added correctly? */ aa8: sw v0,0(s0) /* store poison in next pointer */ aac: lui v0,0x20 /* load 0x200200 in v0 */ ab0: ori v0,v0,0x200 ab4: sw a0,4(v1) /* store our prev pointer in in the prev pointer of next */
The initialisation of the list is done in init_module. prev and next should be set to the list address. So let's search for it:
/* "zero" means here the position were our module was loaded to. */ 1374: lui a1,0x0 /* set a1 to "zero" */ 1378: addiu v1,a1,36 /* 36 is the position of the structure free_client_list, so we set v1 to it */ 137c: lui a0,0x0 /* set a0 to "zero" */ 1380: sw v0,28(a0) /* store pointer to wp_hash */ 1384: sw v1,36(a1) /* store free_client_list.next as free_client_list */ 1388: sw v1,4(v1) /* store free_client_list.prev as free_client_list */
So this looks good too. So when we have a entry, we must have added it somewhere. Lets take a look at packet_recv_thread again where the list_add is
/* v0 and v1 holds pointer to new allocated struct */ /* a3 holds "zero" - like t0 */ 9a8: sw s1,8(v0) /* store pointer to client_data in new data buffer */ 9ac: lw v0,36(a3) /* v0 gets free_client_list.next -> lets call it next_element */ /* shouldn't be another instruction between load and usage of the register? - like a nop */ 9b0: sw v0,0(v1) /* store next_element in tmp_entry.list.next */ 9b4: sw v1,4(v0) /* store pointer to tmp_entry next_element.prev */ 9b8: sw v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next */ 9bc: j 9d4 <cleanup_module+0x524> 9c0: sw t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this would not be executed in parallel, we would get wrong data here, but because we are using mips it must be executed */
So I cannot see anything special - only these two instructions. Maybe we should create a version with debug output after list_add with next and prev pointer of tmp_entry and free_client_list. The same with entry before calling list_del in get_ip_addr. This should be compiled for the nightwing and send to Outback Dingo so he can test it and send the kernel log after a crash to us. The problem is that I added some printks, checked the resulting output and noticed that this changed the output (so the interesting parts aren't there anymore). So if it will run without problems and prints some "use free client from list" in the kernel log then we should have fixed it by resorting some instructions. If not... we should try to find it by using the extra debugging output. If it runs without problems, please remove the printks around the list_add and if it still runs and prints some "use free client from list" now... do it the other way around (keep the printks around list_add and remove it before list_del).
...just my ideas to find the problem.
Best regards Sven Eckelmann
On Wednesday 10 September 2008 00:45:07 Sven Eckelmann wrote:
[...] /* v0 and v1 holds pointer to new allocated struct */ /* a3 holds "zero" - like t0 */ 9a8: sw s1,8(v0) /* store pointer to client_data in new data buffer */ 9ac: lw v0,36(a3) /* v0 gets free_client_list.next -> lets call it next_element */ /* shouldn't be another instruction between load and usage of the register? - like a nop */ 9b0: sw v0,0(v1) /* store next_element in tmp_entry.list.next */ 9b4: sw v1,4(v0) /* store pointer to tmp_entry next_element.prev */ 9b8: sw v1,36(a3) /* store pointer to tmp_entry in freeclient_list.next */ 9bc: j 9d4 <cleanup_module+0x524> 9c0: sw t0,4(v1) /* saves freeclient_list in tmp_entry.prev << if this would not be executed in parallel, we would get wrong data here, but because we are using mips it must be executed */ [...]
Ok, searched inside the official mips32 instruction set reference and both questionable instructions are defined in a hard way without any unpredictable or undefined remarks. So they should be working fine.
Best regards Sven Eckelmann
Hello Dingo,
i know it is a long time ago, but can you please verify if the problem present in Ticket 121 [1] still affects you? Or is it already solved? The entries are 10 months old and the issue is probably already fixed.
Thank you very much, Simon
[1] https://www.open-mesh.net/ticket/121
On Fri, Sep 05, 2008 at 10:01:24PM +0700, Outback Dingo wrote:
see pastebin
http://www.pastebin.ca/1194874
pertinent info dmesg | grep 'batgat loaded' batgat: [init_module:96] batgat loaded rv1025 uname -a Linux nightwing 2.6.23.16 #16 Tue Apr 22 20:00:17 ART 2008 mips unknown root@nightwing:~# batmand -v WARNING: You are using the unstable batman branch. If you are interested in *using* batman get the latest stable release ! B.A.T.M.A.N. 0.3-beta (compatibility version 5) lsmod
Module Size Used by Tainted: P sch_htb 14048 2 ath_ahb 103616 0 wlan_xauth 480 0 wlan_wep 4000 0 wlan_tkip 9856 0 wlan_ccmp 5440 2 wlan_acl 1920 0 ath_rate_minstrel 8352 1 ath_hal 136832 3 ath_ahb,ath_rate_minstrel wlan_scan_sta 8768 1 wlan_scan_ap 6656 0 wlan 152464 10 ath_ahb,wlan_xauth,wlan_wep,wlan_tkip,wlan_ccmp,wlan_acl,ath_rate_minstrel,wlan_scan_sta,wlan_scan_ap batgat 10944 1 ipt_iprange 672 0 ipt_TOS 832 0 ipt_TTL 928 0 xt_MARK 960 3 ipt_ECN 1472 0 xt_CLASSIFY 640 0 ipt_ttl 704 0 ipt_tos 544 0 ipt_time 1568 0 xt_tcpmss 1088 0 xt_statistic 832 0 xt_mark 672 7 xt_mac 736 3 xt_length 736 0 ipt_ecn 1024 0 xt_DSCP 1056 0 xt_dscp 832 0 imq 2096 0 ipt_IMQ 672 2 xt_string 896 0 xt_layer7 9840 0 ipt_ipp2p 6784 0 ipt_LOG 4640 0 xt_CHAOS 1792 0 xt_DELUDE 2624 1 xt_TARPIT 2816 1 xt_quota 800 0 xt_portscan 2016 0 xt_pkttype 704 0 xt_physdev 1488 0 ipt_owner 800 0 iptable_raw 832 0 xt_NOTRACK 832 0 xt_CONNMARK 1088 0 ipt_recent 4992 0 xt_helper 992 0 xt_conntrack 1312 0 xt_connmark 832 0 xt_connbytes 1312 0 tun 6592 0
B.A.T.M.A.N mailing list B.A.T.M.A.N@open-mesh.net https://list.open-mesh.net/mm/listinfo/b.a.t.m.a.n
b.a.t.m.a.n@lists.open-mesh.org