Hi guys,
I think I've seen this bug a couple of times but I've never been able to reproduce it. Now I added a little patch to slow down the activate_module() procedure and the bug occures every time now. My question is, did I make a race condition apparent or did I introduce a bug with this patch?
Cheers, Linus
Okay, I could narrow it down a little further: There is a problem with the num_ifs variable. When activate_module() gets called in proc_interfaces_write() and an ogm of a neighbour arrives after this for the first time but before we've set 'num_ifs = if_num + 1;', then we're not allocating enough space in get_orig_node(), leading to a kernel panic.
num_ifs is just getting used in those two functions, locking this variable seemed an easy choice for fixing this. But nevertheless, I'm unsure if this might be enough, as quite a lot of copies of num_ifs are being stored/modified in a lot of other functions (if_num for instance) which gave me some headaches today :). Therefore I'm doubting the simple locking of num_ifs might be enough. Any ideas how this problem could be dealt with instead?
The problem can be easily reproduced by adding a "ssleep(3)" for instance in front of "num_ifs = if_num + 1;" in proc_interfaces_write(). Then insmod, connect a running batman-adv node to the other end of the interface being used and set those interfaces up. Adding the interface to batman-adv then causes the kernel panic within those 3 seconds then. Putting the ssleep behind num_ifs = ... does not cause any kernel panics on my vm here.
Cheers, Linus
On Mon, Feb 08, 2010 at 08:38:48PM +0100, Linus Lüssing wrote:
Hi guys,
I think I've seen this bug a couple of times but I've never been able to reproduce it. Now I added a little patch to slow down the activate_module() procedure and the bug occures every time now. My question is, did I make a race condition apparent or did I introduce a bug with this patch?
Cheers, Linus
Hi,
I think I've seen this bug a couple of times but I've never been able to reproduce it. Now I added a little patch to slow down the activate_module() procedure and the bug occures every time now. My question is, did I make a race condition apparent or did I introduce a bug with this patch?
the race condition existed before - you just make it more visible. No matter how slow the code is being processed it should not lead to a crash.
Okay, I could narrow it down a little further: There is a problem with the num_ifs variable. When activate_module() gets called in proc_interfaces_write() and an ogm of a neighbour arrives after this for the first time but before we've set 'num_ifs = if_num + 1;', then we're not allocating enough space in get_orig_node(), leading to a kernel panic.
I think you managed to uncover 2 race conditions: * receiving a packet before the module is fully initialized * concurrent activate_module() calls
Better than introducing some locking code which would need to halt the whole module we should make sure that batman-adv does not process packets before its initialization is complete.
Regards, Marek
On Thursday 11 February 2010 17:20:16 Marek Lindner wrote:
Better than introducing some locking code which would need to halt the whole module we should make sure that batman-adv does not process packets before its initialization is complete.
I made a proof-of-concept patch (no testing yet) but it should illustrate what I was talking about. Please try it.
Cheers, Marek
Works fine and really seems to make a difference when trying with the ssleep() stuff from linus (which also crashes in my qemu testbed). After applying your patch it works fine. I have committed it along with some other sanity checks in r1573.
On Fri, Feb 12, 2010 at 07:06:46PM +0800, Marek Lindner wrote:
On Thursday 11 February 2010 17:20:16 Marek Lindner wrote:
Better than introducing some locking code which would need to halt the whole module we should make sure that batman-adv does not process packets before its initialization is complete.
I made a proof-of-concept patch (no testing yet) but it should illustrate what I was talking about. Please try it.
Cheers, Marek
diff --git a/batman-adv-kernelland/proc.c b/batman-adv-kernelland/proc.c index 3f5744d..76dd4d2 100644 --- a/batman-adv-kernelland/proc.c +++ b/batman-adv-kernelland/proc.c @@ -115,10 +115,6 @@ static ssize_t proc_interfaces_write(struct file *instance,
hardif_add_interface(if_string, if_num);
- if ((atomic_read(&module_state) == MODULE_INACTIVE) &&
(hardif_get_active_if_num() > 0))
activate_module();
- rcu_read_lock(); if (list_empty(&if_list)) { rcu_read_unlock();
@@ -127,6 +123,11 @@ static ssize_t proc_interfaces_write(struct file *instance, rcu_read_unlock();
num_ifs = if_num + 1;
- if ((atomic_read(&module_state) == MODULE_INACTIVE) &&
(hardif_get_active_if_num() > 0))
activate_module();
- return count;
end: @@ -453,7 +454,7 @@ static ssize_t proc_aggr_write(struct file *file, const char __user *buffer, atomic_read(&aggregation_enabled), (aggregation_enabled_tmp == 1 ? "enabled" : "disabled"), aggregation_enabled_tmp);
atomic_set(&aggregation_enabled,
}atomic_set(&aggregation_enabled, (unsigned)aggregation_enabled_tmp);
b.a.t.m.a.n@lists.open-mesh.org