Okay, I could narrow it down a little further: There is a problem with the num_ifs variable. When activate_module() gets called in proc_interfaces_write() and an ogm of a neighbour arrives after this for the first time but before we've set 'num_ifs = if_num + 1;', then we're not allocating enough space in get_orig_node(), leading to a kernel panic.
num_ifs is just getting used in those two functions, locking this variable seemed an easy choice for fixing this. But nevertheless, I'm unsure if this might be enough, as quite a lot of copies of num_ifs are being stored/modified in a lot of other functions (if_num for instance) which gave me some headaches today :). Therefore I'm doubting the simple locking of num_ifs might be enough. Any ideas how this problem could be dealt with instead?
The problem can be easily reproduced by adding a "ssleep(3)" for instance in front of "num_ifs = if_num + 1;" in proc_interfaces_write(). Then insmod, connect a running batman-adv node to the other end of the interface being used and set those interfaces up. Adding the interface to batman-adv then causes the kernel panic within those 3 seconds then. Putting the ssleep behind num_ifs = ... does not cause any kernel panics on my vm here.
Cheers, Linus
On Mon, Feb 08, 2010 at 08:38:48PM +0100, Linus Lüssing wrote:
Hi guys,
I think I've seen this bug a couple of times but I've never been able to reproduce it. Now I added a little patch to slow down the activate_module() procedure and the bug occures every time now. My question is, did I make a race condition apparent or did I introduce a bug with this patch?
Cheers, Linus