Linus Lüssing <linus.luessing(a)c0d3.blue> wrote:
This only happens upon sending a SIGTERM to the network manager
"netifd" (so upon network shutdown). And only if the node is connected
to mesh of reasonable size, so if there is a certain amount of
multicast traffic for the multicast-to-multi-unicast patch to work on.
Does this still trigger when you do
nf_reset(newskb);
after skb_copy()?
One difference is that the broadcast flooding adds a bit of
delay between each transmission. Which the multicast-to-multi-unicast
doesn't.
Are those transmits done asynchronously?
conntrack assumes exclusive access to skb->nfct if the conntrack
entry isn't in main hash table.
(i.e, when nf_ct_is_confirmed returns false).
"In nfqueue, two consecutive skbuffs may race to create the
conntrack
entry. Hence, the one that loses the race gets dropped due to clash in
the insertion into the hashes from the nf_conntrack_confirm() path."
This patch is only part of >= 4.18, so not part of the firmware we use
yet. Could this issue somehow be related?
Possible, but I don't think its likely.
In the nfquee case there is asynchronous processing, but
no skb can share the same conntrack entry unless the entry is already
in the conntrack hash table.
Other than that I was wondering whether we might be missing to
reset something after skb_copy()-ing. We do a "skb->protocol =
htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in
batman-adv which sends the encapsulated frame into the
mesh. And we do a nf_reset(skb) after decapsulating a frame
received from the mesh. But maybe that is not enough?
I suggest nf_reset() on xmit, if you can be sure that the xmit
won't occur back-to-self (netns case is fine, as skb scrubbing
resets skb nfct anyway) and the skb isn't on a rexmit list somewhere.
(clone is fine, only shared skb would break).