Hi,
I was trying to implement a multicast-to-multi-unicast conversion in batman-adv with the following patch:
https://patchwork.open-mesh.org/patch/17729/
However, on OpenWrt with a 4.9.146 kernel I get a "Kernel bug detected [...] nf_ct_del_from_dying_or_unconfirmed_list".
This only happens upon sending a SIGTERM to the network manager "netifd" (so upon network shutdown). And only if the node is connected to mesh of reasonable size, so if there is a certain amount of multicast traffic for the multicast-to-multi-unicast patch to work on.
Upon normal operation, no such crash seems to occur.
The crash itself is triggered by the:
BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
in here:
https://elixir.bootlin.com/linux/v4.9.146/source/net/netfilter/nf_conntrack_...
What confuses me a bit is, that the multicast-to-multi-unicast conversion uses the same/similar, simple skb_copy() approach like the "classic broadcast flooding" approach in batman-adv so far. The latter too transmits three redundant frames via skb_copy() to increase reliability for Wifi broadcast packets.
One difference is that the broadcast flooding adds a bit of delay between each transmission. Which the multicast-to-multi-unicast doesn't.
Looking at "git log net/netfilter/nf_conntrack_core.c" I noticed "netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks" (368982cd7). Which says:
"In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path."
This patch is only part of >= 4.18, so not part of the firmware we use yet. Could this issue somehow be related?
Other than that I was wondering whether we might be missing to reset something after skb_copy()-ing. We do a "skb->protocol = htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in batman-adv which sends the encapsulated frame into the mesh. And we do a nf_reset(skb) after decapsulating a frame received from the mesh. But maybe that is not enough?
Ticket this issue was reported at:
https://github.com/freifunk-gluon/gluon/issues/1468
Regards, Linus