The current MTU computation always returns a value smaller than 1500bytes even if the real interfaces have an MTU large enough to compensate the batman-adv overhead.
Fix the computation by properly returning the highest admitted value.
Signed-off-by: Antonio Quartulli antonio@meshcoding.com ---
This patch is missing a Reported-by clause because I did not have "russell"'s email address at hand.
Will be added later before being merged.
Cheers,
hard-interface.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/hard-interface.c b/hard-interface.c index 6792e03..0eb0b3b 100644 --- a/hard-interface.c +++ b/hard-interface.c @@ -244,7 +244,7 @@ int batadv_hardif_min_mtu(struct net_device *soft_iface) { struct batadv_priv *bat_priv = netdev_priv(soft_iface); const struct batadv_hard_iface *hard_iface; - int min_mtu = ETH_DATA_LEN; + int min_mtu = INT_MAX;
rcu_read_lock(); list_for_each_entry_rcu(hard_iface, &batadv_hardif_list, list) { @@ -259,8 +259,6 @@ int batadv_hardif_min_mtu(struct net_device *soft_iface) } rcu_read_unlock();
- atomic_set(&bat_priv->packet_size_max, min_mtu); - if (atomic_read(&bat_priv->fragmentation) == 0) goto out;
@@ -271,13 +269,21 @@ int batadv_hardif_min_mtu(struct net_device *soft_iface) min_mtu = min_t(int, min_mtu, BATADV_FRAG_MAX_FRAG_SIZE); min_mtu -= sizeof(struct batadv_frag_packet); min_mtu *= BATADV_FRAG_MAX_FRAGMENTS; - atomic_set(&bat_priv->packet_size_max, min_mtu); - - /* with fragmentation enabled we can fragment external packets easily */ - min_mtu = min_t(int, min_mtu, ETH_DATA_LEN);
out: - return min_mtu - batadv_max_header_len(); + /* report to the other components the maximum amount of bytes that + * batman-adv can send over the wire (without considering the payload + * overhead). For example, this value is used by TT to compute the + * maximum local table table size + */ + atomic_set(&bat_priv->packet_size_max, min_mtu); + + /* the real soft-interface MTU is computed by removing the payload + * overhead from the maximum amount of bytes that was just computed. + * + * However batman-adv does not support MTUs bigger than ETH_DATA_LEN + */ + return min_t(int, min_mtu - batadv_max_header_len(), ETH_DATA_LEN); }
/* adjusts the MTU if a new interface with a smaller MTU appeared. */
On 21/01/14 11:22, Antonio Quartulli wrote:
The current MTU computation always returns a value smaller than 1500bytes even if the real interfaces have an MTU large enough to compensate the batman-adv overhead.
Fix the computation by properly returning the highest admitted value.
Introduced by f7f2fe494388fca828094a4ebdab918a7b2d64f8 ("batman-adv: limit local translation table max size")
Signed-off-by: Antonio Quartulli antonio@meshcoding.com
On Tuesday 21 January 2014 11:31:08 Antonio Quartulli wrote:
On 21/01/14 11:22, Antonio Quartulli wrote:
The current MTU computation always returns a value smaller than 1500bytes even if the real interfaces have an MTU large enough to compensate the batman-adv overhead.
Fix the computation by properly returning the highest admitted value.
Introduced by f7f2fe494388fca828094a4ebdab918a7b2d64f8 ("batman-adv: limit local translation table max size")
Signed-off-by: Antonio Quartulli antonio@meshcoding.com
Applied in revision 2b108cc.
Thanks, Marek
"Antonio" == Antonio Quartulli antonio@meshcoding.com writes:
Antonio> The current MTU computation always returns a value smaller Antonio> than 1500bytes even if the real interfaces have an MTU large Antonio> enough to compensate the batman-adv overhead.
Antonio> Fix the computation by properly returning the highest Antonio> admitted value.
Antonio> Signed-off-by: Antonio Quartulli antonio@meshcoding.com ---
This seems to fix the bat0-MTU-unnecessarily-small problem I observed last night and reported on the IRC channel. I haven't actually passed any traffic over it yet, but the interface is up with the expected MTU value with the patch.
Antonio> This patch is missing a Reported-by clause because I did not Antonio> have "russell"'s email address at hand.
Antonio> Will be added later before being merged.
Reported-by: Russell Senior russell@personaltelco.net
On 21/01/14 19:43, Russell Senior wrote:
"Antonio" == Antonio Quartulli antonio@meshcoding.com writes:
Antonio> The current MTU computation always returns a value smaller Antonio> than 1500bytes even if the real interfaces have an MTU large Antonio> enough to compensate the batman-adv overhead.
Antonio> Fix the computation by properly returning the highest Antonio> admitted value.
Antonio> Signed-off-by: Antonio Quartulli antonio@meshcoding.com ---
This seems to fix the bat0-MTU-unnecessarily-small problem I observed last night and reported on the IRC channel. I haven't actually passed any traffic over it yet, but the interface is up with the expected MTU value with the patch.
Just to be sure the fix is not introducing any misbehaviour: have you tried setting smaller MTUs to your hard interface? In that case have you seen the bat0 reducing its MTU?
Antonio> This patch is missing a Reported-by clause because I did not Antonio> have "russell"'s email address at hand.
Antonio> Will be added later before being merged.
Reported-by: Russell Senior russell@personaltelco.net
I'd also add Tested-by ;)
Thanks a lot!
"Russell" == Russell Senior russell@personaltelco.net writes:
"Antonio" == Antonio Quartulli antonio@meshcoding.com writes:
Antonio> The current MTU computation always returns a value smaller Antonio> than 1500bytes even if the real interfaces have an MTU large Antonio> enough to compensate the batman-adv overhead.
Antonio> Fix the computation by properly returning the highest Antonio> admitted value.
Antonio> Signed-off-by: Antonio Quartulli antonio@meshcoding.com ---
Russell> This seems to fix the bat0-MTU-unnecessarily-small problem I Russell> observed last night and reported on the IRC channel. I Russell> haven't actually passed any traffic over it yet, but the Russell> interface is up with the expected MTU value with the patch.
Antonio> This patch is missing a Reported-by clause because I did not Antonio> have "russell"'s email address at hand.
Russell> Reported-by: Russell Senior russell@personaltelco.net
Followup, as requested, I tried setting a smaller MTU (1400) on the adhoc0 interface. When fragmentation was enabled, this resulted in no change to MTU (still 1500) for bat0. When I disabled fragmentation, the bat0 MTU dropped, as expected, to 1368. Interestingly, the MTU on the bridge that bat0 was a member of remained 1500 despite the lower bat0 MTU. Should that be?
Also, for testing actual traffic over the batman-adv link, I build OpenWrt r39354 with the patch on a Soekris net4526, so that there were two nodes with the same revision (different architecture): ubnt-bullet-m with ath9k; net4826 with ath5k. I first noticed that I was losing about 100k of memory every couple seconds and pretty soon (with 20 minutes) the net4826 started oopsing on out-of-memory.
I removed the patch, rev'd OpenWrt to r39365 and confirmed that the net4826 build was also leaking at a substantial rate.
I am seeing a similar, though possibly slower, leak on the ubiquiti bullet m2hp. Right before rebooting, top shows kworker/u2:$N (where $N is 0 or 3) chewing up some cpu cycles.
Has anybody else seen this memory leak? Leads on where it's coming from? Not a runaway process, at least not that top shows up. Just a gradual disappearance from MemFree that /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, and I can associate the two devices over adhoc and move a bunch of data with no memory lost, but turning on batman-adv seems to sink it.
On 22/01/14 07:04, Russell Senior wrote:
"Russell" == Russell Senior russell@personaltelco.net writes:
"Antonio" == Antonio Quartulli antonio@meshcoding.com writes:
Antonio> The current MTU computation always returns a value smaller Antonio> than 1500bytes even if the real interfaces have an MTU large Antonio> enough to compensate the batman-adv overhead.
Antonio> Fix the computation by properly returning the highest Antonio> admitted value.
Antonio> Signed-off-by: Antonio Quartulli antonio@meshcoding.com ---
Russell> This seems to fix the bat0-MTU-unnecessarily-small problem I Russell> observed last night and reported on the IRC channel. I Russell> haven't actually passed any traffic over it yet, but the Russell> interface is up with the expected MTU value with the patch.
Antonio> This patch is missing a Reported-by clause because I did not Antonio> have "russell"'s email address at hand.
Russell> Reported-by: Russell Senior russell@personaltelco.net
Followup, as requested, I tried setting a smaller MTU (1400) on the adhoc0 interface. When fragmentation was enabled, this resulted in no change to MTU (still 1500) for bat0. When I disabled fragmentation, the bat0 MTU dropped, as expected, to 1368. Interestingly, the MTU on the bridge that bat0 was a member of remained 1500 despite the lower bat0 MTU. Should that be?
I don't really know how the bridge code behaves. As far as I remember it should adapt to the smallest MTU.
But thanks for testing! This shows that the patch is working fine ;)
On 22/01/14 07:04, Russell Senior wrote:
Also, for testing actual traffic over the batman-adv link, I build OpenWrt r39354 with the patch on a Soekris net4526, so that there were two nodes with the same revision (different architecture): ubnt-bullet-m with ath9k; net4826 with ath5k. I first noticed that I was losing about 100k of memory every couple seconds and pretty soon (with 20 minutes) the net4826 started oopsing on out-of-memory.
mh..does this happen with or without fragmentation enabled? Does this happen even if you don't generate traffic on the interface?
I removed the patch, rev'd OpenWrt to r39365 and confirmed that the net4826 build was also leaking at a substantial rate.
I am seeing a similar, though possibly slower, leak on the ubiquiti bullet m2hp. Right before rebooting, top shows kworker/u2:$N (where $N is 0 or 3) chewing up some cpu cycles.
Has anybody else seen this memory leak? Leads on where it's coming from? Not a runaway process, at least not that top shows up. Just a gradual disappearance from MemFree that /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, and I can associate the two devices over adhoc and move a bunch of data with no memory lost, but turning on batman-adv seems to sink it.
Thanks for reporting!
On 01/22/2014 07:04 AM, Russell Senior wrote:
Has anybody else seen this memory leak? Leads on where it's coming from? Not a runaway process, at least not that top shows up. Just a gradual disappearance from MemFree that /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, and I can associate the two devices over adhoc and move a bunch of data with no memory lost, but turning on batman-adv seems to sink it.
Yes, and I tested (compile-time selected) with and without network coding, and (at run-time) with and without fragmentation (as I also bumped into the MTU calculation problem later fixed by the patch on this list) -- any 32MB RAM devices reboots after roughly 30 minutes due to OOM without substantial traffic, if there is traffic then apparently even faster...
"Daniel" == Daniel daniel@makrotopia.org writes:
Daniel> On 01/22/2014 07:04 AM, Russell Senior wrote:
Has anybody else seen this memory leak? Leads on where it's coming from? Not a runaway process, at least not that top shows up. Just a gradual disappearance from MemFree that /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, and I can associate the two devices over adhoc and move a bunch of data with no memory lost, but turning on batman-adv seems to sink it.
Daniel> Yes, and I tested (compile-time selected) with and without Daniel> network coding, and (at run-time) with and without Daniel> fragmentation (as I also bumped into the MTU calculation Daniel> problem later fixed by the patch on this list) -- any 32MB RAM Daniel> devices reboots after roughly 30 minutes due to OOM without Daniel> substantial traffic, if there is traffic then apparently even Daniel> faster...
The memory leak I see seems to commence as soon as a batman-adv neighbor (same version, in this case 15) appears and stops when the neighbor goes away.
I am going to try enabling kmemleak and see of that tells me anything.
On 22/01/14 18:45, Russell Senior wrote:
"Daniel" == Daniel daniel@makrotopia.org writes:
Daniel> On 01/22/2014 07:04 AM, Russell Senior wrote:
Has anybody else seen this memory leak? Leads on where it's coming from? Not a runaway process, at least not that top shows up. Just a gradual disappearance from MemFree that /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, and I can associate the two devices over adhoc and move a bunch of data with no memory lost, but turning on batman-adv seems to sink it.
Daniel> Yes, and I tested (compile-time selected) with and without Daniel> network coding, and (at run-time) with and without Daniel> fragmentation (as I also bumped into the MTU calculation Daniel> problem later fixed by the patch on this list) -- any 32MB RAM Daniel> devices reboots after roughly 30 minutes due to OOM without Daniel> substantial traffic, if there is traffic then apparently even Daniel> faster...
The memory leak I see seems to commence as soon as a batman-adv neighbor (same version, in this case 15) appears and stops when the neighbor goes away.
Thank you very much for the hint Russel! Today I tried with one node only, but kmemleak did not report anything...
I am going to try enabling kmemleak and see of that tells me anything.
Thanks! Keep us informed!
Cheers,
"Antonio" == Antonio Quartulli antonio@meshcoding.com writes:
Russell> Has anybody else seen this memory leak? Leads on where it's Russell> coming from? Not a runaway process, at least not that top Russell> shows up. Just a gradual disappearance from MemFree that Russell> /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, Russell> and I can associate the two devices over adhoc and move a Russell> bunch of data with no memory lost, but turning on batman-adv Russell> seems to sink it.
Russell> The memory leak I see seems to commence as soon as a Russell> batman-adv neighbor (same version, in this case 15) appears Russell> and stops when the neighbor goes away.
Antonio> Thank you very much for the hint Russel! Today I tried with Antonio> one node only, but kmemleak did not report anything...
Russell> I am going to try enabling kmemleak and see of that tells me Russell> anything.
Antonio> Thanks! Keep us informed!
Here is a bootlog in which I spit out a bunch of kmemleak stuff into a console (captured by /usr/bin/screen, sorry for the extraneous line feed silliness).
https://personaltelco.net/~russell/kmemleak-batman-from-boot.log
If I count instances, it looks like batadv_orig_node_vlan_new (and the things that are calling it) may be implicated.
Hope that helps!
I had the same problem which caused reboots but after the last batman-adv update i am not seeing it. all my devices are have mb ram i am using network coding and 1560 MTU
On 01/22/2014 12:46 PM, Antonio Quartulli wrote:
On 22/01/14 18:45, Russell Senior wrote:
> "Daniel" == Daniel daniel@makrotopia.org writes:
Daniel> On 01/22/2014 07:04 AM, Russell Senior wrote:
Has anybody else seen this memory leak? Leads on where it's coming from? Not a runaway process, at least not that top shows up. Just a gradual disappearance from MemFree that /proc/sys/vm/drop_caches doesn't fix. It isn't adhoc mode, and I can associate the two devices over adhoc and move a bunch of data with no memory lost, but turning on batman-adv seems to sink it.
Daniel> Yes, and I tested (compile-time selected) with and without Daniel> network coding, and (at run-time) with and without Daniel> fragmentation (as I also bumped into the MTU calculation Daniel> problem later fixed by the patch on this list) -- any 32MB RAM Daniel> devices reboots after roughly 30 minutes due to OOM without Daniel> substantial traffic, if there is traffic then apparently even Daniel> faster...
The memory leak I see seems to commence as soon as a batman-adv neighbor (same version, in this case 15) appears and stops when the neighbor goes away.
Thank you very much for the hint Russel! Today I tried with one node only, but kmemleak did not report anything...
I am going to try enabling kmemleak and see of that tells me anything.
Thanks! Keep us informed!
Cheers,
"cmsv" == cmsv cmsv@wirelesspt.net writes:
cmsv> I had the same problem which caused reboots but after the last cmsv> batman-adv update i am not seeing it. all my devices are have cmsv> mb ram i am using network coding and 1560 MTU
Which version are you running?
On 01/22/2014 06:57 PM, Russell Senior wrote:
"cmsv" == cmsv cmsv@wirelesspt.net writes:
cmsv> I had the same problem which caused reboots but after the last cmsv> batman-adv update i am not seeing it. all my devices are have cmsv> mb ram i am using network coding and 1560 MTU
Which version are you running?
Righ now with openwrt AA DISTRIB_REVISION="r39154" and batctl 2014.0.0 [batman-adv: 2014.0.0]
Routers are dir 601 dir 615* tl wr703n and it is not happening I synced the feed less than 48h ago and recompiled.
what version are you using ?
On 01/23/2014 01:10 AM, cmsv wrote:
On 01/22/2014 06:57 PM, Russell Senior wrote:
> "cmsv" == cmsv cmsv@wirelesspt.net writes:
cmsv> I had the same problem which caused reboots but after the last cmsv> batman-adv update i am not seeing it. all my devices are have cmsv> mb ram i am using network coding and 1560 MTU
Which version are you running?
Righ now with openwrt AA DISTRIB_REVISION="r39154" and batctl 2014.0.0 [batman-adv: 2014.0.0]
Routers are dir 601 dir 615* tl wr703n and it is not happening I synced the feed less than 48h ago and recompiled.
what version are you using ?
out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8 with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU computation" on top. A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
I now built OpenWrt trunk/BB r39365 with batman-adv 2013.4.0 instead of 2014.0.0, tried with all possible settings, no memory leak what-so-over, happy uptimes of more than a day by now :) all other system components and settings are exactly identical to my previous setup with batman-adv 2014.0.0.
On 01/23/2014 04:35 AM, Daniel wrote:
On 01/23/2014 01:10 AM, cmsv wrote:
what version are you using ?
out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8 with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU computation" on top. A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
On 26/01/14 13:57, Daniel wrote:
I now built OpenWrt trunk/BB r39365 with batman-adv 2013.4.0 instead of 2014.0.0, tried with all possible settings, no memory leak what-so-over, happy uptimes of more than a day by now :) all other system components and settings are exactly identical to my previous setup with batman-adv 2014.0.0.
On 01/23/2014 04:35 AM, Daniel wrote:
On 01/23/2014 01:10 AM, cmsv wrote:
what version are you using ?
out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8 with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU computation" on top. A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
Thanks for testing guys!
I found something wrong in the code and I am going to send a patch soon. I'd really appreciate if somebody could test it!
Thanks!
There is a refcounter unbalance in the CRC checking routine invoked on OGM reception. A vlan object is retrieved (thus its refcounter is increased by one) but it is never properly released. This leads to a memleak because the vlan object will never be free'd.
Fix this by releasing the vlan object after having read the CRC.
Reported-by: Russell Senior russell@personaltelco.net Reported-by: Daniel daniel@makrotopia.org Reported-by: cmsv cmsv@wirelesspt.net Signed-off-by: Antonio Quartulli antonio@meshcoding.com --- translation-table.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/translation-table.c b/translation-table.c index 3fca99d..097ca01 100644 --- a/translation-table.c +++ b/translation-table.c @@ -2248,6 +2248,7 @@ static bool batadv_tt_global_check_crc(struct batadv_orig_node *orig_node, { struct batadv_tvlv_tt_vlan_data *tt_vlan_tmp; struct batadv_orig_node_vlan *vlan; + uint32_t crc; int i;
/* check if each received CRC matches the locally stored one */ @@ -2267,7 +2268,10 @@ static bool batadv_tt_global_check_crc(struct batadv_orig_node *orig_node, if (!vlan) return false;
- if (vlan->tt.crc != ntohl(tt_vlan_tmp->crc)) + crc = vlan->tt.crc; + batadv_orig_node_vlan_free_ref(vlan); + + if (crc != ntohl(tt_vlan_tmp->crc)) return false; }
"Antonio" == Antonio Quartulli antonio@meshcoding.com writes:
Antonio> There is a refcounter unbalance in the CRC checking routine Antonio> invoked on OGM reception. A vlan object is retrieved (thus Antonio> its refcounter is increased by one) but it is never properly Antonio> released. This leads to a memleak because the vlan object Antonio> will never be free'd.
Antonio> Fix this by releasing the vlan object after having read the Antonio> CRC.
Antonio> Reported-by: Russell Senior russell@personaltelco.net Antonio> [...]
I am still seeing a kernel memory leak, even with this patch.
My test configuration is a Ubiquiti AirRouter (ar71xx) on OpenWrt r39397 with this patch and the MTU patch on one end, and a Soekris net4826 (x86) on the same OpenWrt r39397 revision and patchs on the other end (kmemleak also enabled on the x86 end, not on the ar71xx end).
The airrouter, with the bat0 interface disabled, has a /proc/meminfo that looks like this right after boot:
root@ar0:/# cat /proc/meminfo MemTotal: 29044 kB MemFree: 8220 kB Buffers: 2164 kB Cached: 6176 kB SwapCached: 0 kB Active: 4624 kB Inactive: 5672 kB Active(anon): 1984 kB Inactive(anon): 20 kB Active(file): 2640 kB Inactive(file): 5652 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 1972 kB Mapped: 1872 kB Shmem: 48 kB Slab: 4928 kB SReclaimable: 932 kB SUnreclaim: 3996 kB KernelStack: 272 kB PageTables: 188 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 14520 kB Committed_AS: 3992 kB VmallocTotal: 1048372 kB VmallocUsed: 1476 kB VmallocChunk: 1043448 kB
With bat0 enabled, however, the AirRouter will begin to become unresponsive at around 7m15s after boot and then OOM and reboot within 9 minutes of uptime. Using "watch -d cat /proc/meminfo", I see MemFree first erode, then it begin to chew into Buffers and Cached until supreme unhappiness ensues.
On the x86 side, I see some activity from kmemleak, which substantially goes away when batman-adv does not have a peer. The unreferenced objects that show up in /sys/kernel/debug/kmemleak look like this (only a sample):
unreferenced object 0xc2eab000 (size 184): comm "softirq", pid 0, jiffies 33396 (age 692.960s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 3c c2 00 00 00 00 00 00 00 00 ......<......... backtrace: [<c10039b9>] print_context_stack+0x99/0xb0 [<c108cf00>] kmem_cache_alloc+0xd0/0xf0 [<c12eedc9>] build_skb+0x29/0x140 [<c12eedc9>] build_skb+0x29/0x140 [<c12f1a86>] __netdev_alloc_skb+0x56/0xd0 [<c4c020b2>] ath_rxbuf_alloc+0x22/0x80 [ath] [<c1372619>] free_debug_processing+0x14c/0x190 [<c4c42204>] ath5k_stop+0xd4/0xc90 [ath5k] [<c4c43773>] ath5k_beacon_update_timers+0x9b3/0x9d0 [ath5k] [<c4c02a8f>] ath_hw_cycle_counters_update+0xcf/0x130 [ath] [<c108c727>] kmem_cache_free+0xe7/0x100 [<c105ccfa>] __rcu_process_callbacks+0x5a/0x70 [<c102d62e>] tasklet_action+0x3e/0x70 [<c102dac5>] __do_softirq+0x85/0x140 [<c102da40>] __do_softirq+0x0/0x140 [<c102dc42>] irq_exit+0x32/0x50 unreferenced object 0xc0c1c600 (size 184): comm "softirq", pid 0, jiffies 33490 (age 692.020s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 3c c2 00 00 00 00 00 00 00 00 ......<......... backtrace: [<c10039b9>] print_context_stack+0x99/0xb0 [<c108cf00>] kmem_cache_alloc+0xd0/0xf0 [<c12eedc9>] build_skb+0x29/0x140 [<c12eedc9>] build_skb+0x29/0x140 [<c12f1a86>] __netdev_alloc_skb+0x56/0xd0 [<c4c020b2>] ath_rxbuf_alloc+0x22/0x80 [ath] [<c1372619>] free_debug_processing+0x14c/0x190 [<c4c42204>] ath5k_stop+0xd4/0xc90 [ath5k] [<c4c43773>] ath5k_beacon_update_timers+0x9b3/0x9d0 [ath5k] [<c108c727>] kmem_cache_free+0xe7/0x100 [<c105ccfa>] __rcu_process_callbacks+0x5a/0x70 [<c102d62e>] tasklet_action+0x3e/0x70 [<c102dac5>] __do_softirq+0x85/0x140 [<c102da40>] __do_softirq+0x0/0x140 [<c102dc42>] irq_exit+0x32/0x50 [<c1002b4d>] do_IRQ+0x8d/0xb0
Memory is disappearing at a faster rate than is accounted for here (size 184 bytes at a time). The OOM-reboot rate indicates memory is being consumed at a rate on the order of 30-40 kB per second.
My OpenWrt diffconfig is as follows:
CONFIG_TARGET_x86=y CONFIG_TARGET_x86_generic=y CONFIG_TARGET_x86_generic_Soekris48xx=y CONFIG_DEVEL=y CONFIG_ALFRED_NEEDS_lua=y CONFIG_BUILD_LOG=y CONFIG_DOWNLOAD_FOLDER="/usr/portage/distfiles" CONFIG_KMOD_BATMAN_ADV_BATCTL=y CONFIG_KMOD_BATMAN_ADV_BLA=y CONFIG_KMOD_BATMAN_ADV_DAT=y CONFIG_KMOD_BATMAN_ADV_DEBUG_LOG=y CONFIG_LIBCURL_FILE=y CONFIG_LIBCURL_FTP=y CONFIG_LIBCURL_HTTP=y CONFIG_LIBCURL_POLARSSL=y CONFIG_OPENSSL_WITH_EC=y CONFIG_PACKAGE_ALFRED_BATHOSTS=y CONFIG_PACKAGE_ALFRED_VIS=y CONFIG_PACKAGE_MAC80211_DEBUGFS=y CONFIG_PACKAGE_MAC80211_MESH=y CONFIG_PACKAGE_alfred=y CONFIG_PACKAGE_bridge=y CONFIG_PACKAGE_crda=y CONFIG_PACKAGE_curl=y CONFIG_PACKAGE_diffutils=y # CONFIG_PACKAGE_dnsmasq is not set # CONFIG_PACKAGE_firewall is not set CONFIG_PACKAGE_horst=y CONFIG_PACKAGE_hostapd-common=y CONFIG_PACKAGE_iftop=y CONFIG_PACKAGE_ip=y CONFIG_PACKAGE_iw=y CONFIG_PACKAGE_iwinfo=y CONFIG_PACKAGE_kmod-ath=y CONFIG_PACKAGE_kmod-ath5k=y CONFIG_PACKAGE_kmod-batman-adv=y CONFIG_PACKAGE_kmod-bridge=y CONFIG_PACKAGE_kmod-cfg80211=y CONFIG_PACKAGE_kmod-crypto-aes=y CONFIG_PACKAGE_kmod-crypto-arc4=y CONFIG_PACKAGE_kmod-crypto-core=y CONFIG_PACKAGE_kmod-crypto-crc32c=y CONFIG_PACKAGE_kmod-crypto-hash=y # CONFIG_PACKAGE_kmod-lib-crc-ccitt is not set CONFIG_PACKAGE_kmod-lib-crc16=y CONFIG_PACKAGE_kmod-lib-crc32c=y CONFIG_PACKAGE_kmod-llc=y CONFIG_PACKAGE_kmod-mac80211=y # CONFIG_PACKAGE_kmod-ppp is not set CONFIG_PACKAGE_kmod-stp=y CONFIG_PACKAGE_libcurl=y CONFIG_PACKAGE_libelf1=y CONFIG_PACKAGE_libiwinfo=y CONFIG_PACKAGE_liblua=y CONFIG_PACKAGE_libncurses=y CONFIG_PACKAGE_libnetsnmp=y CONFIG_PACKAGE_libopenssl=y CONFIG_PACKAGE_libpcap=y CONFIG_PACKAGE_libpcre=y CONFIG_PACKAGE_libpolarssl=y CONFIG_PACKAGE_libpopt=y CONFIG_PACKAGE_libpthread=y CONFIG_PACKAGE_librt=y CONFIG_PACKAGE_lua=y CONFIG_PACKAGE_missnet-node=y # CONFIG_PACKAGE_odhcp6c is not set # CONFIG_PACKAGE_ppp is not set CONFIG_PACKAGE_procps=y CONFIG_PACKAGE_procps-free=y CONFIG_PACKAGE_procps-pgrep=y CONFIG_PACKAGE_procps-pkill=y CONFIG_PACKAGE_procps-pmap=y CONFIG_PACKAGE_procps-ps=y CONFIG_PACKAGE_procps-pwdx=y CONFIG_PACKAGE_procps-skill=y CONFIG_PACKAGE_procps-slabtop=y CONFIG_PACKAGE_procps-snice=y CONFIG_PACKAGE_procps-tload=y CONFIG_PACKAGE_procps-top=y CONFIG_PACKAGE_procps-vmstat=y CONFIG_PACKAGE_procps-w=y CONFIG_PACKAGE_procps-watch=y CONFIG_PACKAGE_rsync=y CONFIG_PACKAGE_snmpd=y CONFIG_PACKAGE_tcpdump=y CONFIG_PACKAGE_terminfo=y CONFIG_PACKAGE_traceroute6=y CONFIG_PACKAGE_wget=y CONFIG_PACKAGE_wireless-tools=y CONFIG_PACKAGE_wpad-mini=y CONFIG_PACKAGE_zlib=y # CONFIG_TARGET_ROOTFS_EXT4FS is not set # CONFIG_TARGET_ROOTFS_TARGZ is not set
Anything you are interested in that I left out, I'm happy to provide on request.
Inline:
On 01/26/2014 09:21 AM, Antonio Quartulli wrote:
On 26/01/14 13:57, Daniel wrote:
I now built OpenWrt trunk/BB r39365 with batman-adv 2013.4.0 instead of 2014.0.0, tried with all possible settings, no memory leak what-so-over, happy uptimes of more than a day by now :) all other system components and settings are exactly identical to my previous setup with batman-adv 2014.0.0.
On 01/23/2014 04:35 AM, Daniel wrote:
On 01/23/2014 01:10 AM, cmsv wrote:
what version are you using ?
out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8 with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU computation" on top. A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
Thanks for testing guys!
I found something wrong in the code and I am going to send a patch soon. I'd really appreciate if somebody could test it!
Will this patch also be relevant to attitude adjustment ? The reason why i was if because with AA right now i do not experience the reboots. I should also add that i am using mac80211 r39150 and hostapd r39155 on top of the latest AA.
Can you explain in what exactly your code findings have an impact on ?
Thanks!
On 26/01/14 17:05, cmsv wrote:
Will this patch also be relevant to attitude adjustment ? The reason why i was if because with AA right now i do not experience the reboots. I should also add that i am using mac80211 r39150 and hostapd r39155 on top of the latest AA.
Can you explain in what exactly your code findings have an impact on ?
This is a patch to fix the memleak we were discussing about. This bug appeared with and it is meant to be applied on batman-adv-2014.0.0 (regardless of the openwrt revision).
Cheers,
On 26/01/14 17:07, Antonio Quartulli wrote:Can you explain in what
This is a patch to fix the memleak we were discussing about. This bug appeared with and it is meant to be applied on batman-adv-2014.0.0 (regardless of the openwrt revision).
sorry, bad copy/paste.
The patch is for batman-adv-2014.0.0 (I don't know what version you have in AA). It fixes the memleak bug that we were discussing about.
Here is an update of some tests i ran in the past 24h with the following build:
routers used: dlink dir 601a and tplink wr703n in "ng" mode. (atheros)
My current AA DISTRIB_REVISION="r39154" mac80211 r39150 from openwrt trunk hostapd r39155 from trunk
From batman-adv i am using the following patches:
ls feeds/routing/batman-adv/patches/ 0001-batman-adv-fix-batman-adv-header-overhead-calculatio.patch
From d72756b97529b3c6afa08933216aaa912bb16ce6 Mon Sep 17 00:00:00 2001
From: Marek Lindner mareklindner@neomailbox.ch Date: Wed, 15 Jan 2014 20:31:18 +0800 Subject: [PATCH] batman-adv: fix batman-adv header overhead calculation
batman-adv/Makefile # $Id: Makefile 5624 2006-11-23 00:29:07Z nbd $
include $(TOPDIR)/rules.mk
PKG_NAME:=batman-adv
PKG_VERSION:=2014.0.0 BATCTL_VERSION:=2014.0.0 PKG_RELEASE:=1 PKG_MD5SUM:=8d58ecaede17dc05aab1b549dc09fa7d BATCTL_MD5SUM:=b0bcf29fef80ddcc33769e13f5937d0a
I tried to find any memory leaks that could be causing reboots and i was unable to find any after having compiled the build with batman-adv-header-overhead-calculatio.patch. Before this patch i did get reboots caused by the leak.
I keep monitoring memory usage with top, htop, ps and /proc/meminfo since i was not able to install valgrind due to lack of available flash memory given the size of the valgrind package.
Got some tips from here: http://blog.thewebsitepeople.org/2011/03/linux-memory-leak-detection
Additionally i ran iperf tests on both routers against each other to force them under heavy load during 24h: iperf -c <ip> -t 99999 -i 5
The mtu is 1560 for the adhoc.
After 24h i still had 6 mb of ram and above on both routers. Once i stopped the tests; the ram increased.
Dmesh and logread output nothing wrong and or errors.
No reboots happened during this time which leads me to conclude that the problem might not be all from batman-adv side or maybe not even at all or maybe only happens when in use with something very specific.
I would like to run a few more tests to be more sure about possible leaks but are there any other tools that someone might recommend ?
@ daniel What did you use to find the leak and or how did you troubleshoot it ?
On 01/26/2014 11:13 AM, Antonio Quartulli wrote:
On 26/01/14 17:07, Antonio Quartulli wrote:Can you explain in what
This is a patch to fix the memleak we were discussing about. This bug appeared with and it is meant to be applied on batman-adv-2014.0.0 (regardless of the openwrt revision).
sorry, bad copy/paste.
The patch is for batman-adv-2014.0.0 (I don't know what version you have in AA). It fixes the memleak bug that we were discussing about.
"cmsv" == cmsv cmsv@wirelesspt.net writes:
cmsv> Here is an update of some tests i ran in the past 24h with the cmsv> following build:
cmsv> routers used: dlink dir 601a and tplink wr703n in "ng" cmsv> mode. (atheros)
cmsv> My current AA DISTRIB_REVISION="r39154" mac80211 r39150 from cmsv> openwrt trunk hostapd r39155 from trunk
I just went to try to set up an AA build environment from:
git://git.openwrt.org/12.09/openwrt.git
in order to replicate. The default feeds.conf from that tree seems to point at a 'for-12.09.x' branch of the routing feed, and the batman-adv Makefile there seems to use 2013.4.0, not 2014.0.0.
Can you paste your feeds.conf file?
inline:
On 01/27/2014 08:21 PM, Russell Senior wrote:
"cmsv" == cmsv cmsv@wirelesspt.net writes:
cmsv> Here is an update of some tests i ran in the past 24h with the cmsv> following build:
cmsv> routers used: dlink dir 601a and tplink wr703n in "ng" cmsv> mode. (atheros)
cmsv> My current AA DISTRIB_REVISION="r39154" mac80211 r39150 from cmsv> openwrt trunk hostapd r39155 from trunk
I just went to try to set up an AA build environment from:
git://git.openwrt.org/12.09/openwrt.git
in order to replicate. The default feeds.conf from that tree seems to point at a 'for-12.09.x' branch of the routing feed, and the batman-adv Makefile there seems to use 2013.4.0, not 2014.0.0.
Can you paste your feeds.conf file?
Of course:
for AA and batman-adv 2014.0.0 in feeds.default.conf
src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 src-git routing git://github.com/openwrt-routing/packages.git
For the hostapd and mentioned mac80211 you will need to clone git clone git://git.openwrt.org/12.09/openwrt.git
Then obtain the specific revisions and replace the original hostapd and mac80211 from AA.
"cmsv" == cmsv cmsv@wirelesspt.net writes:
Can you paste your feeds.conf file?
cmsv> Of course:
cmsv> for AA and batman-adv 2014.0.0 in feeds.default.conf
cmsv> src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 cmsv> src-git routing git://github.com/openwrt-routing/packages.git
cmsv> For the hostapd and mentioned mac80211 you will need to clone cmsv> git clone git://git.openwrt.org/12.09/openwrt.git
cmsv> Then obtain the specific revisions and replace the original cmsv> hostapd and mac80211 from AA.
I am not following exactly. Do you know which change in particular makes the memory leak come and go? AA implies an older kernel, 3.3.8 or something.
Also, obtain specific revisions from trunk? and then copy them into the AA tree?
package/kernel/mac80211 r39150 = commit 886b3c876b71122ed9523834488f373908224663 package/network/services/hostapd r39155 = commit 64820db4b264472e03acb9ea6b5536fa7633a8ca
Is that right? Do those mac80211/hostapd revisions come from bisection (i.e. the last "good" rev) or happenstance?
Thanks for clarification!
inline reply:
On 01/29/2014 03:10 AM, Russell Senior wrote:
"cmsv" == cmsv cmsv@wirelesspt.net writes:
Can you paste your feeds.conf file?
cmsv> Of course:
cmsv> for AA and batman-adv 2014.0.0 in feeds.default.conf
cmsv> src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 cmsv> src-git routing git://github.com/openwrt-routing/packages.git
cmsv> For the hostapd and mentioned mac80211 you will need to clone cmsv> git clone git://git.openwrt.org/12.09/openwrt.git
cmsv> Then obtain the specific revisions and replace the original cmsv> hostapd and mac80211 from AA.
I am not following exactly. Do you know which change in particular makes the memory leak come and go?
I do not know exactly what causes the leak because i don't have the leak in my builds and have not found a better way than the ones mentioned before to try to find what may cause it.
AA implies an older kernel, 3.3.8 or something.
Yes 3.3.8
Also, obtain specific revisions from trunk? and then copy them into the AA tree?
Not from trunk. I posted the wrong git before. git clone git://nbd.name/aa-mac80211.git
package/kernel/mac80211 r39150 = commit 886b3c876b71122ed9523834488f373908224663 package/network/services/hostapd r39155 = commit 64820db4b264472e03acb9ea6b5536fa7633a8ca
Is that right? Do those mac80211/hostapd revisions come from bisection (i.e. the last "good" rev) or happenstance?
You have to ask the maintainer. To me they are in between AA and trunk in terms of stability.
Thanks for clarification!
I have an update in regards to this matter and i have CC' ed Felix Fietkau from openwrt (athk) here too since i am using nbd.name/aa-mac80211.git
I decided to compile new images with the latest batman-adv stable patches and in the process of testing the new image as well as the old one i thought to be stable i got the routers to reboot. This time i tested this with more routers in the mesh and was able to replicate it.
It happens that the routers reboot when the gateway disappears either by doing batctl gw client/off or rebooting the gw router. This then causes the others to reboot with Kernel panic - not syncing: Fatal exception in interrupt.
Rebooting the gw router while maintaining gw off did not seem to reboot the other routers. With me the problem is easy to replicate when the router gateway which is providing gateway to the clients disappears. It' s disappearance causes the clients to reboot.
Here is the reboot log:
[ 239.410000] CPU 0 Unable to handle kernel paging request at virtual address 0000000c, epc == 80ea7914, ra == 80ea7910 [ 239.420000] Oops[#1]: [ 239.420000] Cpu 0 [ 239.420000] $ 0 : 00000000 00000001 00000000 00000000 [ 239.420000] $ 4 : 81b12380 80f7fb00 00000000 00000000 [ 239.420000] $ 8 : 00000037 00000000 00000000 00000000 [ 239.420000] $12 : 00000000 0000015f 80e82540 00000000 [ 239.420000] $16 : 81adbc00 00000000 81b12380 80f3e802 [ 239.420000] $20 : 80f7fb00 00000000 00000189 00000000 [ 239.420000] $24 : 00000002 80e365f0 [ 239.420000] $28 : 80fe6000 80fe7ae8 00000043 80ea7910 [ 239.420000] Hi : 000001d5 [ 239.420000] Lo : 0011e189 [ 239.420000] epc : 80ea7914 0x80ea7914 [ 239.420000] Tainted: G O [ 239.420000] ra : 80ea7910 0x80ea7910 [ 239.420000] Status: 1000f403 KERNEL EXL IE [ 239.420000] Cause : 00800008 [ 239.420000] BadVA : 0000000c [ 239.420000] PrId : 00019374 (MIPS 24Kc) [ 239.420000] Modules linked in: ath79_wdt batman_adv(O) nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat xt_conntrack xt_CT xt_NOTRACK iptable
_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipt_REJECT xt_TCPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tabl
es ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) libcrc32c crc16 cfg80211(O) compat(O) arc4 aes_generic crc32c crypto_hash crypto_algapi gpio_button_hotplug(O) [ 239.420000] Process udhcpc (pid: 1267, threadinfo=80fe6000, task=81af8850, tls=77929440) [ 239.420000] Stack : 00000000 00000000 00000000 00000000 0000002a 81adbc00 00000000 81adbc00 [ 239.420000] 81b12000 80f3e802 81b12380 00000000 00000189 80eb1fbc 81b12000 00000000 [ 239.420000] 80e8bd00 80eb86c0 00000000 00000000 00000000 801e98ac 81adbc00 00000000 [ 239.420000] 81b12000 00000000 80e8bd00 80eb86c0 00000000 801ec874 00000000 80dae000 [ 239.420000] 00000000 00000014 80fb7ca8 0200bc00 00000001 00000001 802e0000 81adbc00 [ 239.420000] ... [ 239.420000] Call Trace:[<80eb1fbc>] 0x80eb1fbc [ 239.420000] [<801e98ac>] 0x801e98ac [ 239.420000] [<801ec874>] 0x801ec874 [ 239.420000] [<801ecd5c>] 0x801ecd5c [ 239.420000] [<8026a388>] 0x8026a388 [ 239.420000] [<80218750>] 0x80218750 [ 239.420000] [<802689a4>] 0x802689a4 [ 239.420000] [<801dbf88>] 0x801dbf88 [ 239.420000] [<80218750>] 0x80218750 [ 239.420000] [<801ec874>] 0x801ec874 [ 239.420000] [<80216c50>] 0x80216c50 [ 239.420000] [<80218750>] 0x80218750 [ 239.420000] [<801ecd5c>] 0x801ecd5c [ 239.420000] [<80216c50>] 0x80216c50 [ 239.420000] [<802689b4>] 0x802689b4 [ 239.420000] [<80219eb0>] 0x80219eb0 [ 239.420000] [<80237bb8>] 0x80237bb8 [ 239.420000] [<80239734>] 0x80239734 [ 239.420000] [<8024f668>] 0x8024f668 [ 239.420000] [<801101d4>] 0x801101d4 [ 239.420000] [<8020e3dc>] 0x8020e3dc [ 239.420000] [<801fd38c>] 0x801fd38c [ 239.420000] [<802179f8>] 0x802179f8 [ 239.420000] [<8020ff04>] 0x8020ff04 [ 239.420000] [<801d8154>] 0x801d8154 [ 239.420000] [<80211184>] 0x80211184 [ 239.420000] [<800d8890>] 0x800d8890 [ 239.420000] [<800ec6f0>] 0x800ec6f0 [ 239.420000] [<801d9f58>] 0x801d9f58 [ 239.420000] [<801d93dc>] 0x801d93dc [ 239.420000] [<800d9114>] 0x800d9114 [ 239.420000] [<800d93dc>] 0x800d93dc [ 239.420000] [<801d9a70>] 0x801d9a70 [ 239.420000] [<8006a284>] 0x8006a284 [ 239.420000] [ 239.420000] [ 239.420000] Code: 0c3a9ac3 00402821 0040a821 <8c42000c> 54400052 00008021 8e050054 10a00005 8fb10010 [ 239.730000] ---[ end trace 7d873dc004108502 ]--- [ 239.740000] Kernel panic - not syncing: Fatal exception in interrupt [ 239.740000] Rebooting in 3 seconds..
Routers used: dir 601a & 615c1 tplink wr703n
aa: DISTRIB_REVISION="r39154" hostapd and mac80211 from git://nbd.name/aa-mac80211.git
hostapd: sync with trunk (as of r39155) mac80211: sync with openwrt trunk (as of r39150)
I am able to confirm that this problem does not happen with [batman-adv: 2013.4.0] but it does happen with 2014.0.0 and it is easy to replicate. currently my batman-adv 2014.0.0 package as the following patches:
$ ls feeds/routing/batman-adv/patches/ 0001-batman-adv-fix-batman-adv-header-overhead-calculatio.patch 0003-batman-adv-fix-soft-interface-MTU-computation.patch 0005-batman-adv-release-vlan-object-after-checking-the-CR.patch 0002-batman-adv-fix-potential-kernel-paging-error-for-uni.patch 0004-batman-adv-fix-TT-TVLV-parsing-on-OGM-reception.patch 0007-batman-adv-use-vlan_-eth_hdr-instead-of-skb-data-in-.patch
On 01/29/2014 04:48 PM, cmsv wrote:
inline reply:
On 01/29/2014 03:10 AM, Russell Senior wrote:
> "cmsv" == cmsv cmsv@wirelesspt.net writes:
Can you paste your feeds.conf file?
cmsv> Of course:
cmsv> for AA and batman-adv 2014.0.0 in feeds.default.conf
cmsv> src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 cmsv> src-git routing git://github.com/openwrt-routing/packages.git
cmsv> For the hostapd and mentioned mac80211 you will need to clone cmsv> git clone git://nbd.name/aa-mac80211.git
cmsv> Then obtain the specific revisions and replace the original cmsv> hostapd and mac80211 from AA.
I am not following exactly. Do you know which change in particular makes the memory leak come and go?
I do not know exactly what causes the leak because i don't have the leak in my builds and have not found a better way than the ones mentioned before to try to find what may cause it.
AA implies an older kernel, 3.3.8 or something.
Yes 3.3.8
Also, obtain specific revisions from trunk? and then copy them into the AA tree?
Not from trunk. I posted the wrong git before. git clone git://nbd.name/aa-mac80211.git
package/kernel/mac80211 r39150 = commit 886b3c876b71122ed9523834488f373908224663 package/network/services/hostapd r39155 = commit 64820db4b264472e03acb9ea6b5536fa7633a8ca
Is that right? Do those mac80211/hostapd revisions come from bisection (i.e. the last "good" rev) or happenstance?
You have to ask the maintainer. To me they are in between AA and trunk in terms of stability.
Thanks for clarification!
On 2014-02-08 04:08, cmsv wrote:
[ 239.410000] CPU 0 Unable to handle kernel paging request at virtual address 0000000c, epc == 80ea7914, ra == 80ea7910 [ 239.420000] Oops[#1]: [ 239.420000] Cpu 0 [ 239.420000] $ 0 : 00000000 00000001 00000000 00000000 [ 239.420000] $ 4 : 81b12380 80f7fb00 00000000 00000000 [ 239.420000] $ 8 : 00000037 00000000 00000000 00000000 [ 239.420000] $12 : 00000000 0000015f 80e82540 00000000 [ 239.420000] $16 : 81adbc00 00000000 81b12380 80f3e802 [ 239.420000] $20 : 80f7fb00 00000000 00000189 00000000 [ 239.420000] $24 : 00000002 80e365f0 [ 239.420000] $28 : 80fe6000 80fe7ae8 00000043 80ea7910 [ 239.420000] Hi : 000001d5 [ 239.420000] Lo : 0011e189 [ 239.420000] epc : 80ea7914 0x80ea7914 [ 239.420000] Tainted: G O [ 239.420000] ra : 80ea7910 0x80ea7910 [ 239.420000] Status: 1000f403 KERNEL EXL IE [ 239.420000] Cause : 00800008 [ 239.420000] BadVA : 0000000c [ 239.420000] PrId : 00019374 (MIPS 24Kc) [ 239.420000] Modules linked in: ath79_wdt batman_adv(O) nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat xt_conntrack xt_CT xt_NOTRACK iptable
_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipt_REJECT xt_TCPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tabl
es ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) libcrc32c crc16 cfg80211(O) compat(O) arc4 aes_generic crc32c crypto_hash crypto_algapi gpio_button_hotplug(O) [ 239.420000] Process udhcpc (pid: 1267, threadinfo=80fe6000, task=81af8850, tls=77929440) [ 239.420000] Stack : 00000000 00000000 00000000 00000000 0000002a 81adbc00 00000000 81adbc00 [ 239.420000] 81b12000 80f3e802 81b12380 00000000 00000189 80eb1fbc 81b12000 00000000 [ 239.420000] 80e8bd00 80eb86c0 00000000 00000000 00000000 801e98ac 81adbc00 00000000 [ 239.420000] 81b12000 00000000 80e8bd00 80eb86c0 00000000 801ec874 00000000 80dae000 [ 239.420000] 00000000 00000014 80fb7ca8 0200bc00 00000001 00000001 802e0000 81adbc00 [ 239.420000] ... [ 239.420000] Call Trace:[<80eb1fbc>] 0x80eb1fbc [ 239.420000] [<801e98ac>] 0x801e98ac [ 239.420000] [<801ec874>] 0x801ec874
[...] Just a quick note about logs like this: They're completely worthless unless you enable CONFIG_KERNEL_KALLSYMS in your .config. Without that option, the kernel does not resolve function names, and the addresses shown with a custom build usually do not match the addresses of other builds.
- Felix
On 08/02/14 04:08, cmsv wrote:
[ 239.420000] [<8020ff04>] 0x8020ff04 [ 239.420000] [<801d8154>] 0x801d8154 [ 239.420000] [<80211184>] 0x80211184 [ 239.420000] [<800d8890>] 0x800d8890 [ 239.420000] [<800ec6f0>] 0x800ec6f0 [ 239.420000] [<801d9f58>] 0x801d9f58 [ 239.420000] [<801d93dc>] 0x801d93dc [ 239.420000] [<800d9114>] 0x800d9114 [ 239.420000] [<800d93dc>] 0x800d93dc [ 239.420000] [<801d9a70>] 0x801d9a70 [ 239.420000] [<8006a284>] 0x8006a284 [ 239.420000] [ 239.420000] [ 239.420000] Code: 0c3a9ac3 00402821 0040a821 <8c42000c> 54400052 00008021 8e050054 10a00005 8fb10010 [ 239.730000] ---[ end trace 7d873dc004108502 ]--- [ 239.740000] Kernel panic - not syncing: Fatal exception in interrupt [ 239.740000] Rebooting in 3 seconds..
Hi!
Have you been able to run a test with kernel symbols enabled?? That would be a great help ;)
Cheers,
inline
On 02/12/2014 02:23 AM, Antonio Quartulli wrote:
On 08/02/14 04:08, cmsv wrote:
[ 239.420000] [<8020ff04>] 0x8020ff04 [ 239.420000] [<801d8154>] 0x801d8154 [ 239.420000] [<80211184>] 0x80211184 [ 239.420000] [<800d8890>] 0x800d8890 [ 239.420000] [<800ec6f0>] 0x800ec6f0 [ 239.420000] [<801d9f58>] 0x801d9f58 [ 239.420000] [<801d93dc>] 0x801d93dc [ 239.420000] [<800d9114>] 0x800d9114 [ 239.420000] [<800d93dc>] 0x800d93dc [ 239.420000] [<801d9a70>] 0x801d9a70 [ 239.420000] [<8006a284>] 0x8006a284 [ 239.420000] [ 239.420000] [ 239.420000] Code: 0c3a9ac3 00402821 0040a821 <8c42000c> 54400052 00008021 8e050054 10a00005 8fb10010 [ 239.730000] ---[ end trace 7d873dc004108502 ]--- [ 239.740000] Kernel panic - not syncing: Fatal exception in interrupt [ 239.740000] Rebooting in 3 seconds..
Hi!
Have you been able to run a test with kernel symbols enabled?? That would be a great help ;)
I have tried to compile images with with kernel symbols enabled; but no matter how much i trim/strip down the build to non essencial features; i am unable to create images that fit in 4 mb flash for the routers i have which are mostly dlink routers. Along with shortage of time that i have at the moment i will have to postpone this testing for later and stick with batman-adv 2013.4.0 for now since 2014 is not providing me the same stability.
Last night i tried 2014 again and changed the router that was going to be the gateway and noticed that the reboot was only happening in 1 router instead of 2. Replicating is easy as long as i make the gateway disappear in some way.
Cheers,
On 12/02/14 11:40, cmsv wrote:
Hi!
Have you been able to run a test with kernel symbols enabled?? That would be a great help ;)
I have tried to compile images with with kernel symbols enabled; but no matter how much i trim/strip down the build to non essencial features; i am unable to create images that fit in 4 mb flash for the routers i have which are mostly dlink routers. Along with shortage of time that i have at the moment i will have to postpone this testing for later and stick with batman-adv 2013.4.0 for now since 2014 is not providing me the same stability.
Last night i tried 2014 again and changed the router that was going to be the gateway and noticed that the reboot was only happening in 1 router instead of 2. Replicating is easy as long as i make the gateway disappear in some way.
You should perform the same test now with the new patches that I just sent to the ml.
Maybe your problem was a merely consequence of the bug we just fixed.
Cheers,
I have noticed a few patches being sent but unless i missing something they are all for the development branch. Next week i will be deploying new firmware and create new access points and cannot afford "testing" on production environment. I will be returning to the 2014 branch later on after my trip and will try to debug the issue once and for all which by then i will report my findings.
On 02/12/2014 06:41 AM, Antonio Quartulli wrote:
On 12/02/14 11:40, cmsv wrote:
Hi!
Have you been able to run a test with kernel symbols enabled?? That would be a great help ;)
I have tried to compile images with with kernel symbols enabled; but no matter how much i trim/strip down the build to non essencial features; i am unable to create images that fit in 4 mb flash for the routers i have which are mostly dlink routers. Along with shortage of time that i have at the moment i will have to postpone this testing for later and stick with batman-adv 2013.4.0 for now since 2014 is not providing me the same stability.
Last night i tried 2014 again and changed the router that was going to be the gateway and noticed that the reboot was only happening in 1 router instead of 2. Replicating is easy as long as i make the gateway disappear in some way.
You should perform the same test now with the new patches that I just sent to the ml.
Maybe your problem was a merely consequence of the bug we just fixed.
Cheers,
On 13/02/14 01:55, cmsv wrote:
I have noticed a few patches being sent but unless i missing something they are all for the development branch.
No, most of them are for the maint branch (thus the 2014.0.0 branch).
Next week i will be deploying new firmware and create new access points and cannot afford "testing" on production environment.
I understand. but if you could give it a try before leaving it would be nice! :)
Thanks a lot anyway!
b.a.t.m.a.n@lists.open-mesh.org