This patch fixes a potential memory leak which can occur once an originator times out. On timeout the according global translation table entry might not get purged correctly. Furthermore, the non purged TT entry will cause its orig-node to leak, too. Which additionally can lead to the new multicast optimization feature not kicking in because of a therefore bogus counter.
In the wild with larger mesh networks we saw this leak quite regularly, resulting in routers to reboot or killed processes. This was because of a combination of two bugs: The bug fixed by commit "batman-adv: fix delayed foreign originator recognition" (8a2ad5204674) amplified this memory leak heavily. Since that commit I'd expect it to happen rarely, probably only in paused and resumed VMs and devices previously in stand-by.
The issue this patch fixes is caused by batadv_orig_node_free_rcu() never being called because of not yet released references to the orig-node. References which were supposed to be released through batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().
Fixing the issue by moving batadv_tt_global_del_orig() out of the rcu callback.
Signed-off-by: Linus Lüssing linus.luessing@c0d3.blue --- originator.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/originator.c b/originator.c index 648bdba..bea8198 100644 --- a/originator.c +++ b/originator.c @@ -570,9 +570,6 @@ static void batadv_orig_node_free_rcu(struct rcu_head *rcu)
batadv_frag_purge_orig(orig_node, NULL);
- batadv_tt_global_del_orig(orig_node->bat_priv, orig_node, -1, - "originator timed out"); - if (orig_node->bat_priv->bat_algo_ops->bat_orig_free) orig_node->bat_priv->bat_algo_ops->bat_orig_free(orig_node);
@@ -978,6 +975,9 @@ static void _batadv_purge_orig(struct batadv_priv *bat_priv) if (batadv_purge_orig_node(bat_priv, orig_node)) { batadv_gw_node_delete(bat_priv, orig_node); hlist_del_rcu(&orig_node->hash_entry); + batadv_tt_global_del_orig(orig_node->bat_priv, + orig_node, -1, + "originator timed out"); batadv_orig_node_free_ref(orig_node); continue; }
On Saturday 13 December 2014 23:32:15 Linus Lüssing wrote:
This patch fixes a potential memory leak which can occur once an originator times out. On timeout the according global translation table entry might not get purged correctly. Furthermore, the non purged TT entry will cause its orig-node to leak, too. Which additionally can lead to the new multicast optimization feature not kicking in because of a therefore bogus counter.
So far, I am with you ..
In the wild with larger mesh networks we saw this leak quite regularly, resulting in routers to reboot or killed processes. This was because of a combination of two bugs: The bug fixed by commit "batman-adv: fix delayed foreign originator recognition" (8a2ad5204674) amplified this memory leak heavily. Since that commit I'd expect it to happen rarely, probably only in paused and resumed VMs and devices previously in stand-by.
This section shouldn't be part of the official commit message. It is hardly relevant to the reviewer how often a memleak occurs and whether or not you need a VM to trigger it. The provided commit id isn't valid in the Linux tree.
The issue this patch fixes is caused by batadv_orig_node_free_rcu() never being called because of not yet released references to the orig-node. References which were supposed to be released through batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().
Could you please provide addition insight as to which references are still held ? I did look around but nothing obvious jumped at me.
Generally, it wouldn't be bad if the commit message went into deeper detail describing the nature of the bug instead of the middle section above to make it easy to understand what is being fixed.
Cheers, Marek
On Mon, Dec 29, 2014 at 11:52:53AM +0800, Marek Lindner wrote:
The issue this patch fixes is caused by batadv_orig_node_free_rcu() never being called because of not yet released references to the orig-node. References which were supposed to be released through batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().
Could you please provide addition insight as to which references are still held ? I did look around but nothing obvious jumped at me.
The batadv_tt_global_entry->orig_list holds the reference to the orig-node. Usually this reference is released after BATADV_PURGE_TIMEOUT through: _batadv_purge_orig()-> batadv_purge_orig_node()->batadv_update_route()->_batadv_update_route()-> batadv_tt_global_del_orig() which purges this global tt entry and releases the reference to the orig-node.
However, if between two batadv_purge_orig_node() calls the orig-node timeout grew to 2*BATADV_PURGE_TIMEOUT then this call path isn't reached (*). Instead the according orig-node is removed from the originator hash in _batadv_purge_orig(), the batadv_update_route() part is skipped and won't be reached anymore.
It seems that in that case batadv_orig_node_free_rcu()->batadv_tt_global_del_orig() is supposed to purge the global tt entry and to release the orig-node reference but it's not called because batadv_orig_node_free_rcu() is only called once all references are freed: A chicken 'n' egg situation.
Generally, it wouldn't be bad if the commit message went into deeper detail describing the nature of the bug instead of the middle section above to make it easy to understand what is being fixed.
Hm, hm, my intention was to somehow/somewhere document how these two, small and rare bugs together created this severe bug. The issue fixed by 8a2ad5204674 was the main trigger for (*) in previous releases.
But we can also skip that middle section if you want. Then I'll just add a note to the ticket on redmine.
Cheers, Marek
Cheers, Linus
On 29/12/14 04:52, Marek Lindner wrote:
On Saturday 13 December 2014 23:32:15 Linus Lüssing wrote:
This patch fixes a potential memory leak which can occur once an originator times out. On timeout the according global translation table entry might not get purged correctly. Furthermore, the non purged TT entry will cause its orig-node to leak, too. Which additionally can lead to the new multicast optimization feature not kicking in because of a therefore bogus counter.
So far, I am with you ..
In the wild with larger mesh networks we saw this leak quite regularly, resulting in routers to reboot or killed processes. This was because of a combination of two bugs: The bug fixed by commit "batman-adv: fix delayed foreign originator recognition" (8a2ad5204674) amplified this memory leak heavily. Since that commit I'd expect it to happen rarely, probably only in paused and resumed VMs and devices previously in stand-by.
This section shouldn't be part of the official commit message. It is hardly relevant to the reviewer how often a memleak occurs and whether or not you need a VM to trigger it. The provided commit id isn't valid in the Linux tree.
The issue this patch fixes is caused by batadv_orig_node_free_rcu() never being called because of not yet released references to the orig-node. References which were supposed to be released through batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().
Could you please provide addition insight as to which references are still held ? I did look around but nothing obvious jumped at me.
Generally, it wouldn't be bad if the commit message went into deeper detail describing the nature of the bug instead of the middle section above to make it easy to understand what is being fixed.
Hi Linus and thanks for fixing this TT bug!
I agree with Marek about extending the commit message with a better explanation of the actual bug, but at the same time I think it is good to keep the commit message and ID of the other involved patches.
Moreover, please keep the commit ID of our batman-adv.git tree - I'll then take care of converting them when sending the patches upstream.
Cheers,
On 04/01/15 17:05, Antonio Quartulli wrote:
Hi Linus and thanks for fixing this TT bug!
I agree with Marek about extending the commit message with a better explanation of the actual bug, but at the same time I think it is good to keep the commit message and ID of the other involved patches.
Moreover, please keep the commit ID of our batman-adv.git tree - I'll then take care of converting them when sending the patches upstream.
I forgot..for the patch:
Acked-by: Antonio Quartulli antonio@meshcoding.com
On Sunday 04 January 2015 17:11:43 Antonio Quartulli wrote:
On 04/01/15 17:05, Antonio Quartulli wrote:
Hi Linus and thanks for fixing this TT bug!
I agree with Marek about extending the commit message with a better explanation of the actual bug, but at the same time I think it is good to keep the commit message and ID of the other involved patches.
Moreover, please keep the commit ID of our batman-adv.git tree - I'll then take care of converting them when sending the patches upstream.
I forgot..for the patch:
Acked-by: Antonio Quartulli antonio@meshcoding.com
Applied in revision 21fa264 with commit message beautifications.
Thanks, Marek
b.a.t.m.a.n@lists.open-mesh.org