Hi
I'm sometimes getting a crash after removing a hard interface when the batadv_send_outstanding_bat_org_packet() is called in a work queue. It calls
static void batadv_iv_ogm_aggregate_new(const unsigned char *packet_buff, int packet_len, unsigned long send_time, bool direct_link, struct batadv_hard_iface *if_incoming, struct batadv_hard_iface *if_outgoing, int own_packet) { struct batadv_priv *bat_priv = netdev_priv(if_incoming->soft_iface); struct batadv_forw_packet *forw_packet_aggr; unsigned char *skb_buff; unsigned int skb_size;
if (!kref_get_unless_zero(&if_incoming->refcount)) return;
if (!kref_get_unless_zero(&if_outgoing->refcount)) goto out_free_incoming;
Given that we have:
static inline void batadv_hardif_put(struct batadv_hard_iface *hard_iface) { kref_put(&hard_iface->refcount, batadv_hardif_release); }
does using kref_get_unless_zero() make sense? If it is zero, hasn't it been freed by the kref_put that set it to zero?
Thanks Andrew
On Fri, Mar 04, 2016 at 04:21:32PM +0100, Andrew Lunn wrote:
Hi
I'm sometimes getting a crash after removing a hard interface when the batadv_send_outstanding_bat_org_packet() is called in a work queue. It calls
static void batadv_iv_ogm_aggregate_new(const unsigned char *packet_buff, int packet_len, unsigned long send_time, bool direct_link, struct batadv_hard_iface *if_incoming, struct batadv_hard_iface *if_outgoing, int own_packet) { struct batadv_priv *bat_priv = netdev_priv(if_incoming->soft_iface); struct batadv_forw_packet *forw_packet_aggr; unsigned char *skb_buff; unsigned int skb_size;
if (!kref_get_unless_zero(&if_incoming->refcount)) return; if (!kref_get_unless_zero(&if_outgoing->refcount)) goto out_free_incoming;
Given that we have:
static inline void batadv_hardif_put(struct batadv_hard_iface *hard_iface) { kref_put(&hard_iface->refcount, batadv_hardif_release); }
does using kref_get_unless_zero() make sense? If it is zero, hasn't it been freed by the kref_put that set it to zero?
Not sure if this is the case but what if batadv_iv_ogm_aggregate_new() is called within a rcu_read protected context concurrent to the kref_put setting the refcount to zero ?
If I am not wrong, in this case if_incoming/outgoing will still be valid (until the rcu_read_unlock()) but the refcount will be 0.
Does it make sense ?
Cheers,
On Friday 04 March 2016 16:21:32 Andrew Lunn wrote:
Hi
I'm sometimes getting a crash after removing a hard interface when the batadv_send_outstanding_bat_org_packet() is called in a work queue. It calls
static void batadv_iv_ogm_aggregate_new(const unsigned char *packet_buff, int packet_len, unsigned long send_time, bool direct_link, struct batadv_hard_iface *if_incoming, struct batadv_hard_iface *if_outgoing, int own_packet) { struct batadv_priv *bat_priv = netdev_priv(if_incoming->soft_iface); struct batadv_forw_packet *forw_packet_aggr; unsigned char *skb_buff; unsigned int skb_size;
if (!kref_get_unless_zero(&if_incoming->refcount)) return; if (!kref_get_unless_zero(&if_outgoing->refcount)) goto out_free_incoming;
Given that we have:
static inline void batadv_hardif_put(struct batadv_hard_iface *hard_iface) { kref_put(&hard_iface->refcount, batadv_hardif_release); }
does using kref_get_unless_zero() make sense? If it is zero, hasn't it been freed by the kref_put that set it to zero?
At least it makes sense for the outgoing interface because it is only in a rcu_read_lock in batadv_iv_ogm_schedule (batadv_iv_ogm_queue_add -> batadv_iv_ogm_aggregate_new). The batadv_hardif_list is traversed with list_for_each_entry_rcu and it is expected that one entry (maybe) gets dropped from the list. The batadv_hardif_release will only queue the actual free of the memory (kfree_rcu) and every function which wants to get a reference has to increase the counter with kref_get_unless_zero to check that it is not actually in the waiting-to-be-freed-phase.
But you have something which needs to be fixed (you see a crash). Question is what is causing the crash and what can be done against it. I am currently wondering how the if_incoming interface is being protected. It is not fetched from a list via a rcu list access primitive and it is not protected via rcu_read_lock. I can also not see where the reference for the forw_packet-
if_incoming is increased. It is just accessed in
batadv_send_outstanding_bat_ogm_packet (and later send to the mentioned function via batadv_schedule_bat_ogm). Also batadv_add_bcast_packet_to_list doesn't increase the reference counter for if_incoming before adding to the forward packet. So I would just say that the reference counting for batadv_hard_iface is broken.
Kind regards, Sven
On Friday 04 March 2016 16:50:43 Sven Eckelmann wrote:
batadv_send_outstanding_bat_ogm_packet (and later send to the mentioned function via batadv_schedule_bat_ogm). Also batadv_add_bcast_packet_to_list doesn't increase the reference counter for if_incoming before adding to the forward packet. So I would just say that the reference counting for batadv_hard_iface is broken.
Ah, just saw that batadv_primary_if_get_selected already increased the reference counter. So it is not as easy as I said.
Kind regards, Sven
On Friday 04 March 2016 16:50:43 Sven Eckelmann wrote:
On Friday 04 March 2016 16:21:32 Andrew Lunn wrote:
Hi
I'm sometimes getting a crash after removing a hard interface when the batadv_send_outstanding_bat_org_packet() is called in a work queue. It calls
static void batadv_iv_ogm_aggregate_new(const unsigned char *packet_buff,
int packet_len, unsigned long
send_time, bool direct_link,
struct batadv_hard_iface
*if_incoming, struct batadv_hard_iface *if_outgoing, int own_packet) {
struct batadv_priv *bat_priv = netdev_priv(if_incoming->soft_iface);
struct batadv_forw_packet *forw_packet_aggr;
unsigned char *skb_buff; unsigned int skb_size; if (!kref_get_unless_zero(&if_incoming->refcount)) return; if (!kref_get_unless_zero(&if_outgoing->refcount)) goto out_free_incoming;
Given that we have:
static inline void batadv_hardif_put(struct batadv_hard_iface *hard_iface) {
kref_put(&hard_iface->refcount, batadv_hardif_release);
}
does using kref_get_unless_zero() make sense? If it is zero, hasn't it been freed by the kref_put that set it to zero?
Maybe it would be easier to understand when this would be replaced with kref_get and the if_outgoing loop in batadv_iv_ogm_schedule would be replaced with:
rcu_read_lock(); list_for_each_entry_rcu(tmp_hard_iface, &batadv_hardif_list, list) { if (tmp_hard_iface->soft_iface != hard_iface->soft_iface) continue;
/* make sure only still valid interfaces are used in queue */ if (!kref_get_unless_zero(&tmp_hard_iface->refcount)) continue; batadv_iv_ogm_queue_add(bat_priv, *ogm_buff, *ogm_buff_len, hard_iface, tmp_hard_iface, 1, send_time); batadv_hardif_put(tmp_hard_iface); } rcu_read_unlock();
Sorry for being in noisy-mail mode. I will stop sending mails for today.
Kind regards, Sven
But you have something which needs to be fixed (you see a crash). Question is what is causing the crash and what can be done against it.
First off, this is 2016.0. There has been some changes in this area after that release.
The crash itself is happening in batadv_if_ogm_queue_add(). I don't have an exact matching .lst file for the binary, but:
00000a70 <batadv_iv_ogm_queue_add>: unsigned char *packet_buff, int packet_len, struct batadv_hard_iface *if_incoming, struct batadv_hard_iface *if_outgoing, int own_packet, unsigned long send_time) { a70: 55 push %ebp a71: 89 e5 mov %esp,%ebp a73: 57 push %edi a74: 56 push %esi a75: 89 c6 mov %eax,%esi a77: 53 push %ebx a78: 83 ec 34 sub $0x34,%esp a7b: 89 55 e8 mov %edx,-0x18(%ebp) a7e: 89 4d e4 mov %ecx,-0x1c(%ebp) struct batadv_ogm_packet *batadv_ogm_packet; bool direct_link; unsigned long max_aggregation_jiffies;
batadv_ogm_packet = (struct batadv_ogm_packet *)packet_buff; direct_link = batadv_ogm_packet->flags & BATADV_DIRECTLINK ? 1 : 0;
I think it is this deference of batadv_ogm_packet->flags which is going wrong. I also don't have a good opps dump. I'm on an intel, without a serial port, just a VGA, and it is a recursive fault, and the first Opps has scrolled off the top...
I am currently wondering how the if_incoming interface is being protected.
I do think it is the if_incoming. The call stack is
batadv_send_outstanding_bat_ogm_packet() batadv_schedule_bat_ogm() batadv_tvlv_container_ogm_append() batadv_iv_ogm_queue_add()
so we have
void batadv_send_outstanding_bat_ogm_packet(struct work_struct *work) { ... /* we have to have at least one packet in the queue to determine the * queues wake up time unless we are shutting down. * * only re-schedule if this is the "original" copy, e.g. the OGM of the * primary interface should only be rescheduled once per period, but * this function will be called for the forw_packet instances of the * other secondary interfaces as well. */ if (forw_packet->own && forw_packet->if_incoming == forw_packet->if_outgoing) batadv_schedule_bat_ogm(forw_packet->if_incoming);
I will try to reproduce this with the latest code.
Andrew
b.a.t.m.a.n@lists.open-mesh.org