Hi,
I was trying to implement a multicast-to-multi-unicast conversion in batman-adv with the following patch:
https://patchwork.open-mesh.org/patch/17729/
However, on OpenWrt with a 4.9.146 kernel I get a "Kernel bug detected [...] nf_ct_del_from_dying_or_unconfirmed_list".
This only happens upon sending a SIGTERM to the network manager "netifd" (so upon network shutdown). And only if the node is connected to mesh of reasonable size, so if there is a certain amount of multicast traffic for the multicast-to-multi-unicast patch to work on.
Upon normal operation, no such crash seems to occur.
The crash itself is triggered by the:
BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
in here:
https://elixir.bootlin.com/linux/v4.9.146/source/net/netfilter/nf_conntrack_...
What confuses me a bit is, that the multicast-to-multi-unicast conversion uses the same/similar, simple skb_copy() approach like the "classic broadcast flooding" approach in batman-adv so far. The latter too transmits three redundant frames via skb_copy() to increase reliability for Wifi broadcast packets.
One difference is that the broadcast flooding adds a bit of delay between each transmission. Which the multicast-to-multi-unicast doesn't.
Looking at "git log net/netfilter/nf_conntrack_core.c" I noticed "netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks" (368982cd7). Which says:
"In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path."
This patch is only part of >= 4.18, so not part of the firmware we use yet. Could this issue somehow be related?
Other than that I was wondering whether we might be missing to reset something after skb_copy()-ing. We do a "skb->protocol = htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in batman-adv which sends the encapsulated frame into the mesh. And we do a nf_reset(skb) after decapsulating a frame received from the mesh. But maybe that is not enough?
Ticket this issue was reported at:
https://github.com/freifunk-gluon/gluon/issues/1468
Regards, Linus
Linus Lüssing linus.luessing@c0d3.blue wrote:
This only happens upon sending a SIGTERM to the network manager "netifd" (so upon network shutdown). And only if the node is connected to mesh of reasonable size, so if there is a certain amount of multicast traffic for the multicast-to-multi-unicast patch to work on.
Does this still trigger when you do
nf_reset(newskb);
after skb_copy()?
One difference is that the broadcast flooding adds a bit of delay between each transmission. Which the multicast-to-multi-unicast doesn't.
Are those transmits done asynchronously?
conntrack assumes exclusive access to skb->nfct if the conntrack entry isn't in main hash table.
(i.e, when nf_ct_is_confirmed returns false).
"In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path."
This patch is only part of >= 4.18, so not part of the firmware we use yet. Could this issue somehow be related?
Possible, but I don't think its likely. In the nfquee case there is asynchronous processing, but no skb can share the same conntrack entry unless the entry is already in the conntrack hash table.
Other than that I was wondering whether we might be missing to reset something after skb_copy()-ing. We do a "skb->protocol = htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in batman-adv which sends the encapsulated frame into the mesh. And we do a nf_reset(skb) after decapsulating a frame received from the mesh. But maybe that is not enough?
I suggest nf_reset() on xmit, if you can be sure that the xmit won't occur back-to-self (netns case is fine, as skb scrubbing resets skb nfct anyway) and the skb isn't on a rexmit list somewhere. (clone is fine, only shared skb would break).
I think this is the same issue as this one.
http://patchwork.ozlabs.org/patch/995825/
Florian Westphal fw@strlen.de 於 2019年1月28日 週一 上午6:51寫道:
Linus Lüssing linus.luessing@c0d3.blue wrote:
This only happens upon sending a SIGTERM to the network manager "netifd" (so upon network shutdown). And only if the node is connected to mesh of reasonable size, so if there is a certain amount of multicast traffic for the multicast-to-multi-unicast patch to work on.
Does this still trigger when you do
nf_reset(newskb);
after skb_copy()?
One difference is that the broadcast flooding adds a bit of delay between each transmission. Which the multicast-to-multi-unicast doesn't.
Are those transmits done asynchronously?
conntrack assumes exclusive access to skb->nfct if the conntrack entry isn't in main hash table.
(i.e, when nf_ct_is_confirmed returns false).
"In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path."
This patch is only part of >= 4.18, so not part of the firmware we use yet. Could this issue somehow be related?
Possible, but I don't think its likely. In the nfquee case there is asynchronous processing, but no skb can share the same conntrack entry unless the entry is already in the conntrack hash table.
Other than that I was wondering whether we might be missing to reset something after skb_copy()-ing. We do a "skb->protocol = htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in batman-adv which sends the encapsulated frame into the mesh. And we do a nf_reset(skb) after decapsulating a frame received from the mesh. But maybe that is not enough?
I suggest nf_reset() on xmit, if you can be sure that the xmit won't occur back-to-self (netns case is fine, as skb scrubbing resets skb nfct anyway) and the skb isn't on a rexmit list somewhere. (clone is fine, only shared skb would break).
On Mon, Jan 28, 2019 at 02:39:40PM +0100, Florian Westphal wrote:
Chieh-Min Wang chiehmin18@gmail.com wrote:
I think this is the same issue as this one.
Yes, likely.
I see.
I don't think letting the packet go through is a good idea. Not sure NAT will work fine, packets would go through being unmangled? I think we should still drop the packet until we fix this.
Pablo Neira Ayuso pablo@netfilter.org wrote:
I don't think letting the packet go through is a good idea. Not sure NAT will work fine, packets would go through being unmangled? I think we should still drop the packet until we fix this.
Unfortuntely this is still a band-aid solution, nfqueue + bridge doesn't work when mcast/flood is involved.
Problematic cases are NAT (several bindings on same conntrack simultaneously) and extension realloction. They are not a problem in most cases due to prealloced space and because extensions are commonly added before bridge starts to clone for flooding.
For NAT, the race window is small and iirc we changed nat core to just warn in case the nat bit is already set.
I think it will work fine in most cases with this patch (i.e., witch accept verdict) though; it is better than what we do now.
I think 71d8c47fc653711c4(netfilter: conntrack: introduce clash resolution on insertion race) is doing the same logic for resolving conntrack clashing.
The first packet who win the race should handle the NAT stuff on the conntrack right?
Pablo Neira Ayuso pablo@netfilter.org 於 2019年1月28日 週一 下午9:50寫道:
On Mon, Jan 28, 2019 at 02:39:40PM +0100, Florian Westphal wrote:
Chieh-Min Wang chiehmin18@gmail.com wrote:
I think this is the same issue as this one.
Yes, likely.
I see.
I don't think letting the packet go through is a good idea. Not sure NAT will work fine, packets would go through being unmangled? I think we should still drop the packet until we fix this.
Chieh-Min Wang chiehmin18@gmail.com wrote:
I think 71d8c47fc653711c4(netfilter: conntrack: introduce clash resolution on insertion race) is doing the same logic for resolving conntrack clashing.
No, that commit dealsl with the case where two skbs have different conntrack objects but where tuples are the same.
In nfqueue+bridge flood case the skbs point to the same conntrack object.
Maybe one way to fix this would be to let nfqueue perform a deep copy of skb->_nfct in case conntrack is unconfirmed and skb_shared() is true.
But that would of course cause drop for l4 protocols that do not support clash resolution.
Yes, I understand the cases are different.
However, the result after resolving clashing will cause two skb point to the same conntrack. The first skb with confirmed conntrack may not pass through the NAT when the second skb resolve clashing?
Florian Westphal fw@strlen.de 於 2019年1月28日 週一 下午10:13寫道:
Chieh-Min Wang chiehmin18@gmail.com wrote:
I think 71d8c47fc653711c4(netfilter: conntrack: introduce clash resolution on insertion race) is doing the same logic for resolving conntrack clashing.
No, that commit dealsl with the case where two skbs have different conntrack objects but where tuples are the same.
In nfqueue+bridge flood case the skbs point to the same conntrack object.
Maybe one way to fix this would be to let nfqueue perform a deep copy of skb->_nfct in case conntrack is unconfirmed and skb_shared() is true.
But that would of course cause drop for l4 protocols that do not support clash resolution.
The point I am clarifying is that with NFQUEUE working on the system.
After resolving clashing using 71d8c47fc6537, it may result to two skb holding the same conntrack on different core traveling in the netfilter just like the nfqueue + multicast case.
Chieh-Min Wang chiehmin18@gmail.com 於 2019年1月28日 週一 下午10:16寫道:
Yes, I understand the cases are different.
However, the result after resolving clashing will cause two skb point to the same conntrack. The first skb with confirmed conntrack may not pass through the NAT when the second skb resolve clashing?
Florian Westphal fw@strlen.de 於 2019年1月28日 週一 下午10:13寫道:
Chieh-Min Wang chiehmin18@gmail.com wrote:
I think 71d8c47fc653711c4(netfilter: conntrack: introduce clash resolution on insertion race) is doing the same logic for resolving conntrack clashing.
No, that commit dealsl with the case where two skbs have different conntrack objects but where tuples are the same.
In nfqueue+bridge flood case the skbs point to the same conntrack object.
Maybe one way to fix this would be to let nfqueue perform a deep copy of skb->_nfct in case conntrack is unconfirmed and skb_shared() is true.
But that would of course cause drop for l4 protocols that do not support clash resolution.
On Sun, Jan 27, 2019 at 10:47:08PM +0100, Linus Lüssing wrote: [...]
The crash itself is triggered by the:
BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
in here:
https://elixir.bootlin.com/linux/v4.9.146/source/net/netfilter/nf_conntrack_...
I had tried the nf_reset()s and Wang's patch but with no success.
Skimming through the code I noticed that there aren't that many opportunities for the hnnode to become zero. There are several hlist_nulls_del_rcu(), but no hlist_nulls_del_init_rcu()s for instance.
That started to make me wonder whether something from "outside" might be setting the hnnode to zero - and yeah...
I missed that batadv_send_skb_unicast() always frees/consumes the skb... and I was freeing the skb myself if that call returned !NET_XMIT_SUCCESS. So a double kfree_skb()... I'm a bit surprised that things did not crash more often...
Sorry for the noise :-(. But thanks for all the help and quick responses!
b.a.t.m.a.n@lists.open-mesh.org