I think this is the same issue as this one.
http://patchwork.ozlabs.org/patch/995825/
Florian Westphal fw@strlen.de 於 2019年1月28日 週一 上午6:51寫道:
Linus Lüssing linus.luessing@c0d3.blue wrote:
This only happens upon sending a SIGTERM to the network manager "netifd" (so upon network shutdown). And only if the node is connected to mesh of reasonable size, so if there is a certain amount of multicast traffic for the multicast-to-multi-unicast patch to work on.
Does this still trigger when you do
nf_reset(newskb);
after skb_copy()?
One difference is that the broadcast flooding adds a bit of delay between each transmission. Which the multicast-to-multi-unicast doesn't.
Are those transmits done asynchronously?
conntrack assumes exclusive access to skb->nfct if the conntrack entry isn't in main hash table.
(i.e, when nf_ct_is_confirmed returns false).
"In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path."
This patch is only part of >= 4.18, so not part of the firmware we use yet. Could this issue somehow be related?
Possible, but I don't think its likely. In the nfquee case there is asynchronous processing, but no skb can share the same conntrack entry unless the entry is already in the conntrack hash table.
Other than that I was wondering whether we might be missing to reset something after skb_copy()-ing. We do a "skb->protocol = htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in batman-adv which sends the encapsulated frame into the mesh. And we do a nf_reset(skb) after decapsulating a frame received from the mesh. But maybe that is not enough?
I suggest nf_reset() on xmit, if you can be sure that the xmit won't occur back-to-self (netns case is fine, as skb scrubbing resets skb nfct anyway) and the skb isn't on a rexmit list somewhere. (clone is fine, only shared skb would break).