Hi,
I think I've seen this bug a couple of times but I've never been able to reproduce it. Now I added a little patch to slow down the activate_module() procedure and the bug occures every time now. My question is, did I make a race condition apparent or did I introduce a bug with this patch?
the race condition existed before - you just make it more visible. No matter how slow the code is being processed it should not lead to a crash.
Okay, I could narrow it down a little further: There is a problem with the num_ifs variable. When activate_module() gets called in proc_interfaces_write() and an ogm of a neighbour arrives after this for the first time but before we've set 'num_ifs = if_num + 1;', then we're not allocating enough space in get_orig_node(), leading to a kernel panic.
I think you managed to uncover 2 race conditions: * receiving a packet before the module is fully initialized * concurrent activate_module() calls
Better than introducing some locking code which would need to halt the whole module we should make sure that batman-adv does not process packets before its initialization is complete.
Regards, Marek