Hi,
we are using batman-adv in our Freifunk network for meshing over several interfaces. The nodes are connected to vpn-servers via fastd, the vpn-servers are connected via tincd or gretap.
When we put the fastd-interface and the tinc (or gretap)-inteface into bat0, the kernel crashes after some time. With 2013.4, no crashes occured, but since the update to 2014.3 the issue happens sometimes after 5 minutes, sometimes after 3 days. Upgrading to 2015.1 did not solve the problem, it seems the crashes occur more often then.
Unfornately we have no logs now, because after the crashes no ssh -access to the servers are possible.
nc / mm / bl are set disabled and seem to have no effect on this issue.
Regards Bjoern
* Bjoern Franke bjo@nord-west.org [18.08.2015 09:47]:
Unfornately we have no logs now, because after the crashes no ssh -access to the servers are possible.
try to set
/sbin/sysctl -w kernel.panic_on_oops=1 /sbin/sysctl -w kernel.panic=10 /sbin/sysctl -w vm.panic_on_oom=2
if the devices crashed (and reboots now) you have the crash in the file:
/sys/kernel/debug/crashlog
bye, bastian - happy crashing!
Am Dienstag, den 18.08.2015, 10:08 +0200 schrieb Bastian Bittorf:
- Bjoern Franke bjo@nord-west.org [18.08.2015 09:47]:
Unfornately we have no logs now, because after the crashes no ssh -access to the servers are possible.
try to set
/sbin/sysctl -w kernel.panic_on_oops=1 /sbin/sysctl -w kernel.panic=10 /sbin/sysctl -w vm.panic_on_oom=2
if the devices crashed (and reboots now) you have the crash in the file:
/sys/kernel/debug/crashlog
Thanks for the hint, it did not work on the debian machines, but I got the systems running with crashkernel enabled. Now we got the first crash: https://p.rrbone.net/paste/nnNHrIJI#oHfBMOs2
Regards Bjoern
Hi,
Am 2015-08-18 18:35, schrieb Bjoern Franke:
Thanks for the hint, it did not work on the debian machines, but I got the systems running with crashkernel enabled. Now we got the first crash: https://p.rrbone.net/paste/nnNHrIJI#oHfBMOs2
We've seen these on our Goettingen Freifunk gateways, too. There, too, batadv_frag_purge_orig was the smoking gun. However, I didn't report it, because:
- first and foremost, we were using the outdated legacy 2013.4 version - it was most probably an issue with RCU lists - and either disabling SMP or using a much more current kernel fixed it.
So I blamed a buggy RCU implementation in older kernels, plus maybe some ill behaviour in the old batman-adv codebase. The crashing kernel was the old debian-wheezy one - pretty old, I'd say.
-hwh
On 08/20/2015 12:08 PM, Hans-Werner Hilse wrote:
Hi,
Am 2015-08-18 18:35, schrieb Bjoern Franke:
Thanks for the hint, it did not work on the debian machines, but I got the systems running with crashkernel enabled. Now we got the first crash: https://p.rrbone.net/paste/nnNHrIJI#oHfBMOs2
We've seen these on our Goettingen Freifunk gateways, too. There, too, batadv_frag_purge_orig was the smoking gun. However, I didn't report it, because:
- first and foremost, we were using the outdated legacy 2013.4 version
- it was most probably an issue with RCU lists
- and either disabling SMP or using a much more current kernel fixed it.
So I blamed a buggy RCU implementation in older kernels, plus maybe some ill behaviour in the old batman-adv codebase. The crashing kernel was the old debian-wheezy one - pretty old, I'd say.
-hwh
This is an independent bug (2013.4 uses a completely different fragmentation implementation) that has been reported in https://github.com/freifunk-gluon/batman-adv-legacy/issues/1 . Please don't bother the upstream BATMAN developers with batman-adv-legacy bugs.
Matthias
Hi,
We've seen these on our Goettingen Freifunk gateways, too. There, too, batadv_frag_purge_orig was the smoking gun. However, I didn't report it, because:
- first and foremost, we were using the outdated legacy 2013.4
version
We had also some other issues with older versions, so we upgraded to 2015.1 hoping the gateways will run stable.
- it was most probably an issue with RCU lists
- and either disabling SMP or using a much more current kernel fixed
it.
Did you build own kernels with disabled smp?
So I blamed a buggy RCU implementation in older kernels, plus maybe some ill behaviour in the old batman-adv codebase. The crashing kernel was the old debian-wheezy one - pretty old, I'd say.
We are running 3.16 and partially upgraded to 4.1 for testing purposes. But we have some "general protection fault: 0000 [#1] SMP" unrelated to batman also on the gateways - with different hardware. For the record, a KVM gateway (the other ones are dedicated servers without virtualization) does not crash.
Regards Bjoern
Hi Bjoern,
thanks a lot for reporting this issue - it looks like there are some problems in the cleanup of OGMs when fragmentation is used.
For now, I've created a ticket here:
https://www.open-mesh.org/issues/223
Thanks, Simon
On Tuesday 18 August 2015 18:35:17 Bjoern Franke wrote:
Am Dienstag, den 18.08.2015, 10:08 +0200 schrieb Bastian Bittorf:
- Bjoern Franke bjo@nord-west.org [18.08.2015 09:47]:
Unfornately we have no logs now, because after the crashes no ssh -access to the servers are possible.
try to set
/sbin/sysctl -w kernel.panic_on_oops=1 /sbin/sysctl -w kernel.panic=10 /sbin/sysctl -w vm.panic_on_oom=2
if the devices crashed (and reboots now) you have the crash in the file:
/sys/kernel/debug/crashlog
Thanks for the hint, it did not work on the debian machines, but I got the systems running with crashkernel enabled. Now we got the first crash: https://p.rrbone.net/paste/nnNHrIJI#oHfBMOs2
Regards Bjoern
b.a.t.m.a.n@lists.open-mesh.org