This patchset changes the context of MLD module. Before this patchset, MLD functions are atomic context so it couldn't use sleepable functions and flags.
There are several reasons why MLD functions are under atomic context. 1. It uses timer API. Timer expiration functions are executed in the atomic context. 2. atomic locks MLD functions use rwlock and spinlock to protect their own resources.
So, in order to switch context, this patchset converts resources to use RCU and removes atomic locks and timer API.
1. The first patch convert from the timer API to delayed work. Timer API is used for delaying some works. MLD protocol has a delay mechanism, which is used for replying to a query. If a listener receives a query from a router, it should send a response after some delay. But because of timer expire function is executed in the atomic context, this patch convert from timer API to the delayed work.
2. The fourth patch deletes inet6_dev->mc_lock. The mc_lock has protected inet6_dev->mc_tomb pointer. But this pointer is already protected by RTNL and it isn't be used by datapath. So, it isn't be needed and because of this, many atomic context critical sections are deleted.
3. The fifth patch convert ip6_sf_socklist to RCU. ip6_sf_socklist has been protected by ipv6_mc_socklist->sflock(rwlock). But this is already protected by RTNL So if it is converted to use RCU in order to be used in the datapath, the sflock is no more needed. So, its control path context can be switched to sleepable.
4. The sixth patch convert ip6_sf_list to RCU. The reason for this patch is the same as the previous patch.
5. The seventh patch convert ifmcaddr6 to RCU. The reason for this patch is the same as the previous patch.
6. Add new workqueues for processing query/report event. By this patch, query and report events are processed by workqueue So context is sleepable, not atomic. While this logic, it acquires RTNL.
7. Add new mc_lock. The purpose of this lock is to protect per-interface mld data. Per-interface mld data is usually used by query/report event handler. So, query/report event workers need only this lock instead of RTNL. Therefore, it could reduce bottleneck.
Changelog: v2 -> v3: 1. Do not use msecs_to_jiffies(). (by Cong Wang) 2. Do not add unnecessary rtnl_lock() and rtnl_unlock(). (by Cong Wang) 3. Fix sparse warnings because of rcu annotation. (by kernel test robot) - Remove some rcu_assign_pointer(), which was used for non-rcu pointer. - Add union for rcu pointer. - Use rcu API in mld_clear_zeros(). - Remove remained rcu_read_unlock(). - Use rcu API for tomb resources. 4. withdraw prevopus 2nd and 3rd patch. - "separate two flags from ifmcaddr6->mca_flags" - "add a new delayed_work, mc_delrec_work" 5. Add 6th and 7th patch.
v1 -> v2: 1. Withdraw unnecessary refactoring patches. (by Cong Wang, Eric Dumazet, David Ahern) a) convert from array to list. b) function rename. 2. Separate big one patch into small several patches. 3. Do not rename 'ifmcaddr6->mca_lock'. In the v1 patch, this variable was changed to 'ifmcaddr6->mca_work_lock'. But this is actually not needed. 4. Do not use atomic_t for 'ifmcaddr6->mca_sfcount' and 'ipv6_mc_socklist'->sf_count'. 5. Do not add mld_check_leave_group() function. 6. Do not add ip6_mc_del_src_bulk() function. 7. Do not add ip6_mc_add_src_bulk() function. 8. Do not use rcu_read_lock() in the qeth_l3_add_mcast_rtnl(). (by Julian Wiedmann)
Taehee Yoo (7): mld: convert from timer to delayed work mld: get rid of inet6_dev->mc_lock mld: convert ipv6_mc_socklist->sflist to RCU mld: convert ip6_sf_list to RCU mld: convert ifmcaddr6 to RCU mld: add new workqueues for process mld events mld: add mc_lock for protecting per-interface mld data
drivers/s390/net/qeth_l3_main.c | 6 +- include/net/if_inet6.h | 37 +- include/net/mld.h | 3 + net/batman-adv/multicast.c | 6 +- net/ipv6/addrconf.c | 9 +- net/ipv6/addrconf_core.c | 2 +- net/ipv6/af_inet6.c | 2 +- net/ipv6/icmp.c | 4 +- net/ipv6/mcast.c | 1080 ++++++++++++++++++------------- 9 files changed, 678 insertions(+), 471 deletions(-)
mcast.c has several timers for delaying works. Timer's expire handler is working under atomic context so it can't use sleepable things such as GFP_KERNEL, mutex, etc. In order to use sleepable APIs, it converts from timers to delayed work. But there are some critical sections, which is used by both process and BH context. So that it still uses spin_lock_bh() and rwlock.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v2 -> v3: - Do not use msecs_to_jiffies() - Do not add unnecessary rtnl_lock and rtnl_unlock v1 -> v2: - Separated from previous big one patch.
include/net/if_inet6.h | 8 +-- net/ipv6/mcast.c | 140 +++++++++++++++++++++++------------------ 2 files changed, 83 insertions(+), 65 deletions(-)
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 8bf5906073bc..af5244c9ca5c 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -120,7 +120,7 @@ struct ifmcaddr6 { unsigned int mca_sfmode; unsigned char mca_crcount; unsigned long mca_sfcount[2]; - struct timer_list mca_timer; + struct delayed_work mca_work; unsigned int mca_flags; int mca_users; refcount_t mca_refcnt; @@ -179,9 +179,9 @@ struct inet6_dev { unsigned long mc_qri; /* Query Response Interval */ unsigned long mc_maxdelay;
- struct timer_list mc_gq_timer; /* general query timer */ - struct timer_list mc_ifc_timer; /* interface change timer */ - struct timer_list mc_dad_timer; /* dad complete mc timer */ + struct delayed_work mc_gq_work; /* general query work */ + struct delayed_work mc_ifc_work; /* interface change work */ + struct delayed_work mc_dad_work; /* dad complete mc work */
struct ifacaddr6 *ac_list; rwlock_t lock; diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 6c8604390266..692a6dec8959 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -29,7 +29,6 @@ #include <linux/socket.h> #include <linux/sockios.h> #include <linux/jiffies.h> -#include <linux/times.h> #include <linux/net.h> #include <linux/in.h> #include <linux/in6.h> @@ -42,6 +41,7 @@ #include <linux/slab.h> #include <linux/pkt_sched.h> #include <net/mld.h> +#include <linux/workqueue.h>
#include <linux/netfilter.h> #include <linux/netfilter_ipv6.h> @@ -67,14 +67,13 @@ static int __mld2_query_bugs[] __attribute__((__unused__)) = { BUILD_BUG_ON_ZERO(offsetof(struct mld2_grec, grec_mca) % 4) };
+static struct workqueue_struct *mld_wq; static struct in6_addr mld2_all_mcr = MLD2_ALL_MCR_INIT;
static void igmp6_join_group(struct ifmcaddr6 *ma); static void igmp6_leave_group(struct ifmcaddr6 *ma); -static void igmp6_timer_handler(struct timer_list *t); +static void mld_mca_work(struct work_struct *work);
-static void mld_gq_timer_expire(struct timer_list *t); -static void mld_ifc_timer_expire(struct timer_list *t); static void mld_ifc_event(struct inet6_dev *idev); static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *pmc); static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *pmc); @@ -713,7 +712,7 @@ static void igmp6_group_dropped(struct ifmcaddr6 *mc) igmp6_leave_group(mc);
spin_lock_bh(&mc->mca_lock); - if (del_timer(&mc->mca_timer)) + if (cancel_delayed_work(&mc->mca_work)) refcount_dec(&mc->mca_refcnt); spin_unlock_bh(&mc->mca_lock); } @@ -854,7 +853,7 @@ static struct ifmcaddr6 *mca_alloc(struct inet6_dev *idev, if (!mc) return NULL;
- timer_setup(&mc->mca_timer, igmp6_timer_handler, 0); + INIT_DELAYED_WORK(&mc->mca_work, mld_mca_work);
mc->mca_addr = *addr; mc->idev = idev; /* reference taken by caller */ @@ -1027,48 +1026,48 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, return rv; }
-static void mld_gq_start_timer(struct inet6_dev *idev) +static void mld_gq_start_work(struct inet6_dev *idev) { unsigned long tv = prandom_u32() % idev->mc_maxdelay;
idev->mc_gq_running = 1; - if (!mod_timer(&idev->mc_gq_timer, jiffies+tv+2)) + if (!mod_delayed_work(mld_wq, &idev->mc_gq_work, tv + 2)) in6_dev_hold(idev); }
-static void mld_gq_stop_timer(struct inet6_dev *idev) +static void mld_gq_stop_work(struct inet6_dev *idev) { idev->mc_gq_running = 0; - if (del_timer(&idev->mc_gq_timer)) + if (cancel_delayed_work(&idev->mc_gq_work)) __in6_dev_put(idev); }
-static void mld_ifc_start_timer(struct inet6_dev *idev, unsigned long delay) +static void mld_ifc_start_work(struct inet6_dev *idev, unsigned long delay) { unsigned long tv = prandom_u32() % delay;
- if (!mod_timer(&idev->mc_ifc_timer, jiffies+tv+2)) + if (!mod_delayed_work(mld_wq, &idev->mc_ifc_work, tv + 2)) in6_dev_hold(idev); }
-static void mld_ifc_stop_timer(struct inet6_dev *idev) +static void mld_ifc_stop_work(struct inet6_dev *idev) { idev->mc_ifc_count = 0; - if (del_timer(&idev->mc_ifc_timer)) + if (cancel_delayed_work(&idev->mc_ifc_work)) __in6_dev_put(idev); }
-static void mld_dad_start_timer(struct inet6_dev *idev, unsigned long delay) +static void mld_dad_start_work(struct inet6_dev *idev, unsigned long delay) { unsigned long tv = prandom_u32() % delay;
- if (!mod_timer(&idev->mc_dad_timer, jiffies+tv+2)) + if (!mod_delayed_work(mld_wq, &idev->mc_dad_work, tv + 2)) in6_dev_hold(idev); }
-static void mld_dad_stop_timer(struct inet6_dev *idev) +static void mld_dad_stop_work(struct inet6_dev *idev) { - if (del_timer(&idev->mc_dad_timer)) + if (cancel_delayed_work(&idev->mc_dad_work)) __in6_dev_put(idev); }
@@ -1080,21 +1079,20 @@ static void igmp6_group_queried(struct ifmcaddr6 *ma, unsigned long resptime) { unsigned long delay = resptime;
- /* Do not start timer for these addresses */ + /* Do not start work for these addresses */ if (ipv6_addr_is_ll_all_nodes(&ma->mca_addr) || IPV6_ADDR_MC_SCOPE(&ma->mca_addr) < IPV6_ADDR_SCOPE_LINKLOCAL) return;
- if (del_timer(&ma->mca_timer)) { + if (cancel_delayed_work(&ma->mca_work)) { refcount_dec(&ma->mca_refcnt); - delay = ma->mca_timer.expires - jiffies; + delay = ma->mca_work.timer.expires - jiffies; }
if (delay >= resptime) delay = prandom_u32() % resptime;
- ma->mca_timer.expires = jiffies + delay; - if (!mod_timer(&ma->mca_timer, jiffies + delay)) + if (!mod_delayed_work(mld_wq, &ma->mca_work, delay)) refcount_inc(&ma->mca_refcnt); ma->mca_flags |= MAF_TIMER_RUNNING; } @@ -1305,10 +1303,10 @@ static int mld_process_v1(struct inet6_dev *idev, struct mld_msg *mld, if (v1_query) mld_set_v1_mode(idev);
- /* cancel MLDv2 report timer */ - mld_gq_stop_timer(idev); - /* cancel the interface change timer */ - mld_ifc_stop_timer(idev); + /* cancel MLDv2 report work */ + mld_gq_stop_work(idev); + /* cancel the interface change work */ + mld_ifc_stop_work(idev); /* clear deleted report items */ mld_clear_delrec(idev);
@@ -1398,7 +1396,7 @@ int igmp6_event_query(struct sk_buff *skb) if (mlh2->mld2q_nsrcs) return -EINVAL; /* no sources allowed */
- mld_gq_start_timer(idev); + mld_gq_start_work(idev); return 0; } /* mark sources to include, if group & source-specific */ @@ -1482,14 +1480,14 @@ int igmp6_event_report(struct sk_buff *skb) return -ENODEV;
/* - * Cancel the timer for this group + * Cancel the work for this group */
read_lock_bh(&idev->lock); for (ma = idev->mc_list; ma; ma = ma->next) { if (ipv6_addr_equal(&ma->mca_addr, &mld->mld_mca)) { spin_lock(&ma->mca_lock); - if (del_timer(&ma->mca_timer)) + if (cancel_delayed_work(&ma->mca_work)) refcount_dec(&ma->mca_refcnt); ma->mca_flags &= ~(MAF_LAST_REPORTER|MAF_TIMER_RUNNING); spin_unlock(&ma->mca_lock); @@ -2103,21 +2101,23 @@ void ipv6_mc_dad_complete(struct inet6_dev *idev) mld_send_initial_cr(idev); idev->mc_dad_count--; if (idev->mc_dad_count) - mld_dad_start_timer(idev, - unsolicited_report_interval(idev)); + mld_dad_start_work(idev, + unsolicited_report_interval(idev)); } }
-static void mld_dad_timer_expire(struct timer_list *t) +static void mld_dad_work(struct work_struct *work) { - struct inet6_dev *idev = from_timer(idev, t, mc_dad_timer); + struct inet6_dev *idev = container_of(to_delayed_work(work), + struct inet6_dev, + mc_dad_work);
mld_send_initial_cr(idev); if (idev->mc_dad_count) { idev->mc_dad_count--; if (idev->mc_dad_count) - mld_dad_start_timer(idev, - unsolicited_report_interval(idev)); + mld_dad_start_work(idev, + unsolicited_report_interval(idev)); } in6_dev_put(idev); } @@ -2416,12 +2416,12 @@ static void igmp6_join_group(struct ifmcaddr6 *ma) delay = prandom_u32() % unsolicited_report_interval(ma->idev);
spin_lock_bh(&ma->mca_lock); - if (del_timer(&ma->mca_timer)) { + if (cancel_delayed_work(&ma->mca_work)) { refcount_dec(&ma->mca_refcnt); - delay = ma->mca_timer.expires - jiffies; + delay = ma->mca_work.timer.expires - jiffies; }
- if (!mod_timer(&ma->mca_timer, jiffies + delay)) + if (!mod_delayed_work(mld_wq, &ma->mca_work, delay)) refcount_inc(&ma->mca_refcnt); ma->mca_flags |= MAF_TIMER_RUNNING | MAF_LAST_REPORTER; spin_unlock_bh(&ma->mca_lock); @@ -2458,25 +2458,29 @@ static void igmp6_leave_group(struct ifmcaddr6 *ma) } }
-static void mld_gq_timer_expire(struct timer_list *t) +static void mld_gq_work(struct work_struct *work) { - struct inet6_dev *idev = from_timer(idev, t, mc_gq_timer); + struct inet6_dev *idev = container_of(to_delayed_work(work), + struct inet6_dev, + mc_gq_work);
idev->mc_gq_running = 0; mld_send_report(idev, NULL); in6_dev_put(idev); }
-static void mld_ifc_timer_expire(struct timer_list *t) +static void mld_ifc_work(struct work_struct *work) { - struct inet6_dev *idev = from_timer(idev, t, mc_ifc_timer); + struct inet6_dev *idev = container_of(to_delayed_work(work), + struct inet6_dev, + mc_ifc_work);
mld_send_cr(idev); if (idev->mc_ifc_count) { idev->mc_ifc_count--; if (idev->mc_ifc_count) - mld_ifc_start_timer(idev, - unsolicited_report_interval(idev)); + mld_ifc_start_work(idev, + unsolicited_report_interval(idev)); } in6_dev_put(idev); } @@ -2486,22 +2490,23 @@ static void mld_ifc_event(struct inet6_dev *idev) if (mld_in_v1_mode(idev)) return; idev->mc_ifc_count = idev->mc_qrv; - mld_ifc_start_timer(idev, 1); + mld_ifc_start_work(idev, 1); }
-static void igmp6_timer_handler(struct timer_list *t) +static void mld_mca_work(struct work_struct *work) { - struct ifmcaddr6 *ma = from_timer(ma, t, mca_timer); + struct ifmcaddr6 *ma = container_of(to_delayed_work(work), + struct ifmcaddr6, mca_work);
if (mld_in_v1_mode(ma->idev)) igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REPORT); else mld_send_report(ma->idev, ma);
- spin_lock(&ma->mca_lock); + spin_lock_bh(&ma->mca_lock); ma->mca_flags |= MAF_LAST_REPORTER; ma->mca_flags &= ~MAF_TIMER_RUNNING; - spin_unlock(&ma->mca_lock); + spin_unlock_bh(&ma->mca_lock); ma_put(ma); }
@@ -2537,12 +2542,12 @@ void ipv6_mc_down(struct inet6_dev *idev) for (i = idev->mc_list; i; i = i->next) igmp6_group_dropped(i);
- /* Should stop timer after group drop. or we will - * start timer again in mld_ifc_event() + /* Should stop work after group drop. or we will + * start work again in mld_ifc_event() */ - mld_ifc_stop_timer(idev); - mld_gq_stop_timer(idev); - mld_dad_stop_timer(idev); + mld_ifc_stop_work(idev); + mld_gq_stop_work(idev); + mld_dad_stop_work(idev); read_unlock_bh(&idev->lock); }
@@ -2579,11 +2584,11 @@ void ipv6_mc_init_dev(struct inet6_dev *idev) write_lock_bh(&idev->lock); spin_lock_init(&idev->mc_lock); idev->mc_gq_running = 0; - timer_setup(&idev->mc_gq_timer, mld_gq_timer_expire, 0); + INIT_DELAYED_WORK(&idev->mc_gq_work, mld_gq_work); idev->mc_tomb = NULL; idev->mc_ifc_count = 0; - timer_setup(&idev->mc_ifc_timer, mld_ifc_timer_expire, 0); - timer_setup(&idev->mc_dad_timer, mld_dad_timer_expire, 0); + INIT_DELAYED_WORK(&idev->mc_ifc_work, mld_ifc_work); + INIT_DELAYED_WORK(&idev->mc_dad_work, mld_dad_work); ipv6_mc_reset(idev); write_unlock_bh(&idev->lock); } @@ -2596,7 +2601,7 @@ void ipv6_mc_destroy_dev(struct inet6_dev *idev) { struct ifmcaddr6 *i;
- /* Deactivate timers */ + /* Deactivate works */ ipv6_mc_down(idev); mld_clear_delrec(idev);
@@ -2763,7 +2768,7 @@ static int igmp6_mc_seq_show(struct seq_file *seq, void *v) &im->mca_addr, im->mca_users, im->mca_flags, (im->mca_flags&MAF_TIMER_RUNNING) ? - jiffies_to_clock_t(im->mca_timer.expires-jiffies) : 0); + jiffies_to_clock_t(im->mca_work.timer.expires - jiffies) : 0); return 0; }
@@ -3002,7 +3007,19 @@ static struct pernet_operations igmp6_net_ops = {
int __init igmp6_init(void) { - return register_pernet_subsys(&igmp6_net_ops); + int err; + + err = register_pernet_subsys(&igmp6_net_ops); + if (err) + return err; + + mld_wq = create_workqueue("mld"); + if (!mld_wq) { + unregister_pernet_subsys(&igmp6_net_ops); + return -ENOMEM; + } + + return err; }
int __init igmp6_late_init(void) @@ -3013,6 +3030,7 @@ int __init igmp6_late_init(void) void igmp6_cleanup(void) { unregister_pernet_subsys(&igmp6_net_ops); + destroy_workqueue(mld_wq); }
void igmp6_late_cleanup(void)
The purpose of mc_lock is to protect inet6_dev->mc_tomb. But mc_tomb is already protected by RTNL and all functions, which manipulate mc_tomb are called under RTNL. So, mc_lock is not needed. Furthermore, it is spinlock so the critical section is atomic. In order to reduce atomic context, it should be removed.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v2 -> v3: - No change v1 -> v2: - Separated from previous big one patch.
include/net/if_inet6.h | 1 - net/ipv6/mcast.c | 9 --------- 2 files changed, 10 deletions(-)
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index af5244c9ca5c..1080d2248304 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -167,7 +167,6 @@ struct inet6_dev {
struct ifmcaddr6 *mc_list; struct ifmcaddr6 *mc_tomb; - spinlock_t mc_lock;
unsigned char mc_qrv; /* Query Robustness Variable */ unsigned char mc_gq_running; diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 692a6dec8959..35962aa3cc22 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -752,10 +752,8 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) } spin_unlock_bh(&im->mca_lock);
- spin_lock_bh(&idev->mc_lock); pmc->next = idev->mc_tomb; idev->mc_tomb = pmc; - spin_unlock_bh(&idev->mc_lock); }
static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) @@ -764,7 +762,6 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) struct ip6_sf_list *psf; struct in6_addr *pmca = &im->mca_addr;
- spin_lock_bh(&idev->mc_lock); pmc_prev = NULL; for (pmc = idev->mc_tomb; pmc; pmc = pmc->next) { if (ipv6_addr_equal(&pmc->mca_addr, pmca)) @@ -777,7 +774,6 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) else idev->mc_tomb = pmc->next; } - spin_unlock_bh(&idev->mc_lock);
spin_lock_bh(&im->mca_lock); if (pmc) { @@ -801,10 +797,8 @@ static void mld_clear_delrec(struct inet6_dev *idev) { struct ifmcaddr6 *pmc, *nextpmc;
- spin_lock_bh(&idev->mc_lock); pmc = idev->mc_tomb; idev->mc_tomb = NULL; - spin_unlock_bh(&idev->mc_lock);
for (; pmc; pmc = nextpmc) { nextpmc = pmc->next; @@ -1907,7 +1901,6 @@ static void mld_send_cr(struct inet6_dev *idev) int type, dtype;
read_lock_bh(&idev->lock); - spin_lock(&idev->mc_lock);
/* deleted MCA's */ pmc_prev = NULL; @@ -1941,7 +1934,6 @@ static void mld_send_cr(struct inet6_dev *idev) } else pmc_prev = pmc; } - spin_unlock(&idev->mc_lock);
/* change recs */ for (pmc = idev->mc_list; pmc; pmc = pmc->next) { @@ -2582,7 +2574,6 @@ void ipv6_mc_up(struct inet6_dev *idev) void ipv6_mc_init_dev(struct inet6_dev *idev) { write_lock_bh(&idev->lock); - spin_lock_init(&idev->mc_lock); idev->mc_gq_running = 0; INIT_DELAYED_WORK(&idev->mc_gq_work, mld_gq_work); idev->mc_tomb = NULL;
The sflist has been protected by rwlock so that the critical section is atomic context. In order to switch this context, changing locking is needed. The sflist actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v2 -> v3: - Fix sparse warnings because of rcu annotation v1 -> v2: - Separated from previous big one patch.
include/net/if_inet6.h | 4 ++-- net/ipv6/mcast.c | 52 ++++++++++++++++++------------------------ 2 files changed, 24 insertions(+), 32 deletions(-)
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 1080d2248304..062294aeeb6d 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -78,6 +78,7 @@ struct inet6_ifaddr { struct ip6_sf_socklist { unsigned int sl_max; unsigned int sl_count; + struct rcu_head rcu; struct in6_addr sl_addr[]; };
@@ -91,8 +92,7 @@ struct ipv6_mc_socklist { int ifindex; unsigned int sfmode; /* MCAST_{INCLUDE,EXCLUDE} */ struct ipv6_mc_socklist __rcu *next; - rwlock_t sflock; - struct ip6_sf_socklist *sflist; + struct ip6_sf_socklist __rcu *sflist; struct rcu_head rcu; };
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 35962aa3cc22..9da55d23a13c 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -178,8 +178,7 @@ static int __ipv6_sock_mc_join(struct sock *sk, int ifindex,
mc_lst->ifindex = dev->ifindex; mc_lst->sfmode = mode; - rwlock_init(&mc_lst->sflock); - mc_lst->sflist = NULL; + RCU_INIT_POINTER(mc_lst->sflist, NULL);
/* * now add/increase the group membership on the device @@ -335,7 +334,6 @@ int ip6_mc_source(int add, int omode, struct sock *sk, struct net *net = sock_net(sk); int i, j, rv; int leavegroup = 0; - int pmclocked = 0; int err;
source = &((struct sockaddr_in6 *)&pgsr->gsr_source)->sin6_addr; @@ -364,7 +362,7 @@ int ip6_mc_source(int add, int omode, struct sock *sk, goto done; } /* if a source filter was set, must be the same mode as before */ - if (pmc->sflist) { + if (rcu_access_pointer(pmc->sflist)) { if (pmc->sfmode != omode) { err = -EINVAL; goto done; @@ -376,10 +374,7 @@ int ip6_mc_source(int add, int omode, struct sock *sk, pmc->sfmode = omode; }
- write_lock(&pmc->sflock); - pmclocked = 1; - - psl = pmc->sflist; + psl = rtnl_dereference(pmc->sflist); if (!add) { if (!psl) goto done; /* err = -EADDRNOTAVAIL */ @@ -429,9 +424,11 @@ int ip6_mc_source(int add, int omode, struct sock *sk, if (psl) { for (i = 0; i < psl->sl_count; i++) newpsl->sl_addr[i] = psl->sl_addr[i]; - sock_kfree_s(sk, psl, IP6_SFLSIZE(psl->sl_max)); + atomic_sub(IP6_SFLSIZE(psl->sl_max), &sk->sk_omem_alloc); + kfree_rcu(psl, rcu); } - pmc->sflist = psl = newpsl; + psl = newpsl; + rcu_assign_pointer(pmc->sflist, psl); } rv = 1; /* > 0 for insert logic below if sl_count is 0 */ for (i = 0; i < psl->sl_count; i++) { @@ -447,8 +444,6 @@ int ip6_mc_source(int add, int omode, struct sock *sk, /* update the interface list */ ip6_mc_add_src(idev, group, omode, 1, source, 1); done: - if (pmclocked) - write_unlock(&pmc->sflock); read_unlock_bh(&idev->lock); rcu_read_unlock(); if (leavegroup) @@ -526,17 +521,16 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf, (void) ip6_mc_add_src(idev, group, gsf->gf_fmode, 0, NULL, 0); }
- write_lock(&pmc->sflock); - psl = pmc->sflist; + psl = rtnl_dereference(pmc->sflist); if (psl) { (void) ip6_mc_del_src(idev, group, pmc->sfmode, psl->sl_count, psl->sl_addr, 0); - sock_kfree_s(sk, psl, IP6_SFLSIZE(psl->sl_max)); + atomic_sub(IP6_SFLSIZE(psl->sl_max), &sk->sk_omem_alloc); + kfree_rcu(psl, rcu); } else (void) ip6_mc_del_src(idev, group, pmc->sfmode, 0, NULL, 0); - pmc->sflist = newpsl; + rcu_assign_pointer(pmc->sflist, newpsl); pmc->sfmode = gsf->gf_fmode; - write_unlock(&pmc->sflock); err = 0; done: read_unlock_bh(&idev->lock); @@ -585,16 +579,14 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf, if (!pmc) /* must have a prior join */ goto done; gsf->gf_fmode = pmc->sfmode; - psl = pmc->sflist; + psl = rtnl_dereference(pmc->sflist); count = psl ? psl->sl_count : 0; read_unlock_bh(&idev->lock); rcu_read_unlock();
copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc; gsf->gf_numsrc = count; - /* changes to psl require the socket lock, and a write lock - * on pmc->sflock. We have the socket lock so reading here is safe. - */ + for (i = 0; i < copycount; i++, p++) { struct sockaddr_in6 *psin6; struct sockaddr_storage ss; @@ -630,8 +622,7 @@ bool inet6_mc_check(struct sock *sk, const struct in6_addr *mc_addr, rcu_read_unlock(); return np->mc_all; } - read_lock(&mc->sflock); - psl = mc->sflist; + psl = rcu_dereference(mc->sflist); if (!psl) { rv = mc->sfmode == MCAST_EXCLUDE; } else { @@ -646,7 +637,6 @@ bool inet6_mc_check(struct sock *sk, const struct in6_addr *mc_addr, if (mc->sfmode == MCAST_EXCLUDE && i < psl->sl_count) rv = false; } - read_unlock(&mc->sflock); rcu_read_unlock();
return rv; @@ -2422,19 +2412,21 @@ static void igmp6_join_group(struct ifmcaddr6 *ma) static int ip6_mc_leave_src(struct sock *sk, struct ipv6_mc_socklist *iml, struct inet6_dev *idev) { + struct ip6_sf_socklist *psl; int err;
- write_lock_bh(&iml->sflock); - if (!iml->sflist) { + psl = rtnl_dereference(iml->sflist); + + if (!psl) { /* any-source empty exclude case */ err = ip6_mc_del_src(idev, &iml->addr, iml->sfmode, 0, NULL, 0); } else { err = ip6_mc_del_src(idev, &iml->addr, iml->sfmode, - iml->sflist->sl_count, iml->sflist->sl_addr, 0); - sock_kfree_s(sk, iml->sflist, IP6_SFLSIZE(iml->sflist->sl_max)); - iml->sflist = NULL; + psl->sl_count, psl->sl_addr, 0); + RCU_INIT_POINTER(iml->sflist, NULL); + atomic_sub(IP6_SFLSIZE(psl->sl_max), &sk->sk_omem_alloc); + kfree_rcu(psl, rcu); } - write_unlock_bh(&iml->sflock); return err; }
The ip6_sf_list has been protected by mca_lock(spin_lock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ip6_sf_list actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. But It doesn't remove mca_lock yet because ifmcaddr6 isn't converted to RCU yet. So, It's not fully converted to the sleepable context.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v2 -> v3: - Fix sparse warnings because of rcu annotation v1 -> v2: - Separated from previous big one patch.
include/net/if_inet6.h | 7 +- net/ipv6/mcast.c | 200 ++++++++++++++++++++++++++--------------- 2 files changed, 130 insertions(+), 77 deletions(-)
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 062294aeeb6d..7875a3208426 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -97,12 +97,13 @@ struct ipv6_mc_socklist { };
struct ip6_sf_list { - struct ip6_sf_list *sf_next; + struct ip6_sf_list __rcu *sf_next; struct in6_addr sf_addr; unsigned long sf_count[2]; /* include/exclude counts */ unsigned char sf_gsresp; /* include in g & s response? */ unsigned char sf_oldin; /* change state */ unsigned char sf_crcount; /* retrans. left to send */ + struct rcu_head rcu; };
#define MAF_TIMER_RUNNING 0x01 @@ -115,8 +116,8 @@ struct ifmcaddr6 { struct in6_addr mca_addr; struct inet6_dev *idev; struct ifmcaddr6 *next; - struct ip6_sf_list *mca_sources; - struct ip6_sf_list *mca_tomb; + struct ip6_sf_list __rcu *mca_sources; + struct ip6_sf_list __rcu *mca_tomb; unsigned int mca_sfmode; unsigned char mca_crcount; unsigned long mca_sfcount[2]; diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 9da55d23a13c..bc0fb4815c97 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -113,10 +113,25 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; */
#define for_each_pmc_rcu(np, pmc) \ - for (pmc = rcu_dereference(np->ipv6_mc_list); \ - pmc != NULL; \ + for (pmc = rcu_dereference((np)->ipv6_mc_list); \ + pmc; \ pmc = rcu_dereference(pmc->next))
+#define for_each_psf_rtnl(mc, psf) \ + for (psf = rtnl_dereference((mc)->mca_sources); \ + psf; \ + psf = rtnl_dereference(psf->sf_next)) + +#define for_each_psf_rcu(mc, psf) \ + for (psf = rcu_dereference((mc)->mca_sources); \ + psf; \ + psf = rcu_dereference(psf->sf_next)) + +#define for_each_psf_tomb(mc, psf) \ + for (psf = rtnl_dereference((mc)->mca_tomb); \ + psf; \ + psf = rtnl_dereference(psf->sf_next)) + static int unsolicited_report_interval(struct inet6_dev *idev) { int iv; @@ -734,10 +749,14 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) if (pmc->mca_sfmode == MCAST_INCLUDE) { struct ip6_sf_list *psf;
- pmc->mca_tomb = im->mca_tomb; - pmc->mca_sources = im->mca_sources; - im->mca_tomb = im->mca_sources = NULL; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) + rcu_assign_pointer(pmc->mca_tomb, + rtnl_dereference(im->mca_tomb)); + rcu_assign_pointer(pmc->mca_sources, + rtnl_dereference(im->mca_sources)); + RCU_INIT_POINTER(im->mca_tomb, NULL); + RCU_INIT_POINTER(im->mca_sources, NULL); + + for_each_psf_rtnl(pmc, psf) psf->sf_crcount = pmc->mca_crcount; } spin_unlock_bh(&im->mca_lock); @@ -748,9 +767,9 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im)
static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) { - struct ifmcaddr6 *pmc, *pmc_prev; - struct ip6_sf_list *psf; + struct ip6_sf_list *psf, *sources, *tomb; struct in6_addr *pmca = &im->mca_addr; + struct ifmcaddr6 *pmc, *pmc_prev;
pmc_prev = NULL; for (pmc = idev->mc_tomb; pmc; pmc = pmc->next) { @@ -769,9 +788,16 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) if (pmc) { im->idev = pmc->idev; if (im->mca_sfmode == MCAST_INCLUDE) { - swap(im->mca_tomb, pmc->mca_tomb); - swap(im->mca_sources, pmc->mca_sources); - for (psf = im->mca_sources; psf; psf = psf->sf_next) + tomb = rcu_replace_pointer(im->mca_tomb, + rtnl_dereference(pmc->mca_tomb), + lockdep_rtnl_is_held()); + rcu_assign_pointer(pmc->mca_tomb, tomb); + + sources = rcu_replace_pointer(im->mca_sources, + rtnl_dereference(pmc->mca_sources), + lockdep_rtnl_is_held()); + rcu_assign_pointer(pmc->mca_sources, sources); + for_each_psf_rtnl(im, psf) psf->sf_crcount = idev->mc_qrv; } else { im->mca_crcount = idev->mc_qrv; @@ -803,12 +829,12 @@ static void mld_clear_delrec(struct inet6_dev *idev) struct ip6_sf_list *psf, *psf_next;
spin_lock_bh(&pmc->mca_lock); - psf = pmc->mca_tomb; - pmc->mca_tomb = NULL; + psf = rtnl_dereference(pmc->mca_tomb); + RCU_INIT_POINTER(pmc->mca_tomb, NULL); spin_unlock_bh(&pmc->mca_lock); for (; psf; psf = psf_next) { - psf_next = psf->sf_next; - kfree(psf); + psf_next = rtnl_dereference(psf->sf_next); + kfree_rcu(psf, rcu); } } read_unlock_bh(&idev->lock); @@ -990,7 +1016,7 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, struct ip6_sf_list *psf;
spin_lock_bh(&mc->mca_lock); - for (psf = mc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rcu(mc, psf) { if (ipv6_addr_equal(&psf->sf_addr, src_addr)) break; } @@ -1089,7 +1115,7 @@ static bool mld_xmarksources(struct ifmcaddr6 *pmc, int nsrcs, int i, scount;
scount = 0; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rcu(pmc, psf) { if (scount == nsrcs) break; for (i = 0; i < nsrcs; i++) { @@ -1122,7 +1148,7 @@ static bool mld_marksources(struct ifmcaddr6 *pmc, int nsrcs, /* mark INCLUDE-mode sources */
scount = 0; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rcu(pmc, psf) { if (scount == nsrcs) break; for (i = 0; i < nsrcs; i++) { @@ -1532,7 +1558,7 @@ mld_scount(struct ifmcaddr6 *pmc, int type, int gdeleted, int sdeleted) struct ip6_sf_list *psf; int scount = 0;
- for (psf = pmc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rtnl(pmc, psf) { if (!is_in(pmc, psf, type, gdeleted, sdeleted)) continue; scount++; @@ -1707,14 +1733,16 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc, #define AVAILABLE(skb) ((skb) ? skb_availroom(skb) : 0)
static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, - int type, int gdeleted, int sdeleted, int crsend) + int type, int gdeleted, int sdeleted, + int crsend) { + struct ip6_sf_list *psf, *psf_prev, *psf_next; + int scount, stotal, first, isquery, truncate; + struct ip6_sf_list __rcu **psf_list; struct inet6_dev *idev = pmc->idev; struct net_device *dev = idev->dev; - struct mld2_report *pmr; struct mld2_grec *pgr = NULL; - struct ip6_sf_list *psf, *psf_next, *psf_prev, **psf_list; - int scount, stotal, first, isquery, truncate; + struct mld2_report *pmr; unsigned int mtu;
if (pmc->mca_flags & MAF_NOREPORT) @@ -1733,7 +1761,7 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc,
psf_list = sdeleted ? &pmc->mca_tomb : &pmc->mca_sources;
- if (!*psf_list) + if (!rcu_access_pointer(*psf_list)) goto empty_source;
pmr = skb ? (struct mld2_report *)skb_transport_header(skb) : NULL; @@ -1749,10 +1777,12 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, } first = 1; psf_prev = NULL; - for (psf = *psf_list; psf; psf = psf_next) { + for (psf = rtnl_dereference(*psf_list); + psf; + psf = psf_next) { struct in6_addr *psrc;
- psf_next = psf->sf_next; + psf_next = rtnl_dereference(psf->sf_next);
if (!is_in(pmc, psf, type, gdeleted, sdeleted) && !crsend) { psf_prev = psf; @@ -1799,10 +1829,12 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, psf->sf_crcount--; if ((sdeleted || gdeleted) && psf->sf_crcount == 0) { if (psf_prev) - psf_prev->sf_next = psf->sf_next; + rcu_assign_pointer(psf_prev->sf_next, + rtnl_dereference(psf->sf_next)); else - *psf_list = psf->sf_next; - kfree(psf); + rcu_assign_pointer(*psf_list, + rtnl_dereference(psf->sf_next)); + kfree_rcu(psf, rcu); continue; } } @@ -1866,21 +1898,26 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc) /* * remove zero-count source records from a source filter list */ -static void mld_clear_zeros(struct ip6_sf_list **ppsf) +static void mld_clear_zeros(struct ip6_sf_list __rcu **ppsf) { struct ip6_sf_list *psf_prev, *psf_next, *psf;
psf_prev = NULL; - for (psf = *ppsf; psf; psf = psf_next) { - psf_next = psf->sf_next; + for (psf = rtnl_dereference(*ppsf); + psf; + psf = psf_next) { + psf_next = rtnl_dereference(psf->sf_next); if (psf->sf_crcount == 0) { if (psf_prev) - psf_prev->sf_next = psf->sf_next; + rcu_assign_pointer(psf_prev->sf_next, + rtnl_dereference(psf->sf_next)); else - *ppsf = psf->sf_next; - kfree(psf); - } else + rcu_assign_pointer(*ppsf, + rtnl_dereference(psf->sf_next)); + kfree_rcu(psf, rcu); + } else { psf_prev = psf; + } } }
@@ -1913,8 +1950,9 @@ static void mld_send_cr(struct inet6_dev *idev) mld_clear_zeros(&pmc->mca_sources); } } - if (pmc->mca_crcount == 0 && !pmc->mca_tomb && - !pmc->mca_sources) { + if (pmc->mca_crcount == 0 && + !rcu_access_pointer(pmc->mca_tomb) && + !rcu_access_pointer(pmc->mca_sources)) { if (pmc_prev) pmc_prev->next = pmc_next; else @@ -2111,7 +2149,7 @@ static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode, int rv = 0;
psf_prev = NULL; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rtnl(pmc, psf) { if (ipv6_addr_equal(&psf->sf_addr, psfsrc)) break; psf_prev = psf; @@ -2126,17 +2164,22 @@ static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode,
/* no more filters for this source */ if (psf_prev) - psf_prev->sf_next = psf->sf_next; + rcu_assign_pointer(psf_prev->sf_next, + rtnl_dereference(psf->sf_next)); else - pmc->mca_sources = psf->sf_next; + rcu_assign_pointer(pmc->mca_sources, + rtnl_dereference(psf->sf_next)); + if (psf->sf_oldin && !(pmc->mca_flags & MAF_NOREPORT) && !mld_in_v1_mode(idev)) { psf->sf_crcount = idev->mc_qrv; - psf->sf_next = pmc->mca_tomb; - pmc->mca_tomb = psf; + rcu_assign_pointer(psf->sf_next, + rtnl_dereference(pmc->mca_tomb)); + rcu_assign_pointer(pmc->mca_tomb, psf); rv = 1; - } else - kfree(psf); + } else { + kfree_rcu(psf, rcu); + } } return rv; } @@ -2188,7 +2231,7 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, pmc->mca_sfmode = MCAST_INCLUDE; pmc->mca_crcount = idev->mc_qrv; idev->mc_ifc_count = pmc->mca_crcount; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) + for_each_psf_rtnl(pmc, psf) psf->sf_crcount = 0; mld_ifc_event(pmc->idev); } else if (sf_setstate(pmc) || changerec) @@ -2207,7 +2250,7 @@ static int ip6_mc_add1_src(struct ifmcaddr6 *pmc, int sfmode, struct ip6_sf_list *psf, *psf_prev;
psf_prev = NULL; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rtnl(pmc, psf) { if (ipv6_addr_equal(&psf->sf_addr, psfsrc)) break; psf_prev = psf; @@ -2219,9 +2262,10 @@ static int ip6_mc_add1_src(struct ifmcaddr6 *pmc, int sfmode,
psf->sf_addr = *psfsrc; if (psf_prev) { - psf_prev->sf_next = psf; - } else - pmc->mca_sources = psf; + rcu_assign_pointer(psf_prev->sf_next, psf); + } else { + rcu_assign_pointer(pmc->mca_sources, psf); + } } psf->sf_count[sfmode]++; return 0; @@ -2232,13 +2276,15 @@ static void sf_markstate(struct ifmcaddr6 *pmc) struct ip6_sf_list *psf; int mca_xcount = pmc->mca_sfcount[MCAST_EXCLUDE];
- for (psf = pmc->mca_sources; psf; psf = psf->sf_next) + for_each_psf_rtnl(pmc, psf) { if (pmc->mca_sfcount[MCAST_EXCLUDE]) { psf->sf_oldin = mca_xcount == psf->sf_count[MCAST_EXCLUDE] && !psf->sf_count[MCAST_INCLUDE]; - } else + } else { psf->sf_oldin = psf->sf_count[MCAST_INCLUDE] != 0; + } + } }
static int sf_setstate(struct ifmcaddr6 *pmc) @@ -2249,7 +2295,7 @@ static int sf_setstate(struct ifmcaddr6 *pmc) int new_in, rv;
rv = 0; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) { + for_each_psf_rtnl(pmc, psf) { if (pmc->mca_sfcount[MCAST_EXCLUDE]) { new_in = mca_xcount == psf->sf_count[MCAST_EXCLUDE] && !psf->sf_count[MCAST_INCLUDE]; @@ -2259,8 +2305,7 @@ static int sf_setstate(struct ifmcaddr6 *pmc) if (!psf->sf_oldin) { struct ip6_sf_list *prev = NULL;
- for (dpsf = pmc->mca_tomb; dpsf; - dpsf = dpsf->sf_next) { + for_each_psf_tomb(pmc, dpsf) { if (ipv6_addr_equal(&dpsf->sf_addr, &psf->sf_addr)) break; @@ -2268,10 +2313,12 @@ static int sf_setstate(struct ifmcaddr6 *pmc) } if (dpsf) { if (prev) - prev->sf_next = dpsf->sf_next; + rcu_assign_pointer(prev->sf_next, + rtnl_dereference(dpsf->sf_next)); else - pmc->mca_tomb = dpsf->sf_next; - kfree(dpsf); + rcu_assign_pointer(pmc->mca_tomb, + rtnl_dereference(dpsf->sf_next)); + kfree_rcu(dpsf, rcu); } psf->sf_crcount = qrv; rv++; @@ -2282,7 +2329,8 @@ static int sf_setstate(struct ifmcaddr6 *pmc) * add or update "delete" records if an active filter * is now inactive */ - for (dpsf = pmc->mca_tomb; dpsf; dpsf = dpsf->sf_next) + + for_each_psf_tomb(pmc, dpsf) if (ipv6_addr_equal(&dpsf->sf_addr, &psf->sf_addr)) break; @@ -2291,9 +2339,9 @@ static int sf_setstate(struct ifmcaddr6 *pmc) if (!dpsf) continue; *dpsf = *psf; - /* pmc->mca_lock held by callers */ - dpsf->sf_next = pmc->mca_tomb; - pmc->mca_tomb = dpsf; + rcu_assign_pointer(dpsf->sf_next, + rtnl_dereference(pmc->mca_tomb)); + rcu_assign_pointer(pmc->mca_tomb, dpsf); } dpsf->sf_crcount = qrv; rv++; @@ -2356,7 +2404,7 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca,
pmc->mca_crcount = idev->mc_qrv; idev->mc_ifc_count = pmc->mca_crcount; - for (psf = pmc->mca_sources; psf; psf = psf->sf_next) + for_each_psf_rtnl(pmc, psf) psf->sf_crcount = 0; mld_ifc_event(idev); } else if (sf_setstate(pmc)) @@ -2370,16 +2418,20 @@ static void ip6_mc_clear_src(struct ifmcaddr6 *pmc) { struct ip6_sf_list *psf, *nextpsf;
- for (psf = pmc->mca_tomb; psf; psf = nextpsf) { - nextpsf = psf->sf_next; - kfree(psf); + for (psf = rtnl_dereference(pmc->mca_tomb); + psf; + psf = nextpsf) { + nextpsf = rtnl_dereference(psf->sf_next); + kfree_rcu(psf, rcu); } - pmc->mca_tomb = NULL; - for (psf = pmc->mca_sources; psf; psf = nextpsf) { - nextpsf = psf->sf_next; - kfree(psf); + RCU_INIT_POINTER(pmc->mca_tomb, NULL); + for (psf = rtnl_dereference(pmc->mca_sources); + psf; + psf = nextpsf) { + nextpsf = rtnl_dereference(psf->sf_next); + kfree_rcu(psf, rcu); } - pmc->mca_sources = NULL; + RCU_INIT_POINTER(pmc->mca_sources, NULL); pmc->mca_sfmode = MCAST_EXCLUDE; pmc->mca_sfcount[MCAST_INCLUDE] = 0; pmc->mca_sfcount[MCAST_EXCLUDE] = 1; @@ -2789,7 +2841,7 @@ static inline struct ip6_sf_list *igmp6_mcf_get_first(struct seq_file *seq) im = idev->mc_list; if (likely(im)) { spin_lock_bh(&im->mca_lock); - psf = im->mca_sources; + psf = rcu_dereference(im->mca_sources); if (likely(psf)) { state->im = im; state->idev = idev; @@ -2806,7 +2858,7 @@ static struct ip6_sf_list *igmp6_mcf_get_next(struct seq_file *seq, struct ip6_s { struct igmp6_mcf_iter_state *state = igmp6_mcf_seq_private(seq);
- psf = psf->sf_next; + psf = rcu_dereference(psf->sf_next); while (!psf) { spin_unlock_bh(&state->im->mca_lock); state->im = state->im->next; @@ -2828,7 +2880,7 @@ static struct ip6_sf_list *igmp6_mcf_get_next(struct seq_file *seq, struct ip6_s if (!state->im) break; spin_lock_bh(&state->im->mca_lock); - psf = state->im->mca_sources; + psf = rcu_dereference(state->im->mca_sources); } out: return psf;
The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ifmcaddr6 actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v2 -> v3: - Fix sparse warnings because of rcu annotation - Do not use mca_lock v1 -> v2: - Separated from previous big one patch.
drivers/s390/net/qeth_l3_main.c | 6 +- include/net/if_inet6.h | 7 +- net/batman-adv/multicast.c | 6 +- net/ipv6/addrconf.c | 9 +- net/ipv6/addrconf_core.c | 2 +- net/ipv6/af_inet6.c | 2 +- net/ipv6/mcast.c | 296 +++++++++++++------------------- 7 files changed, 140 insertions(+), 188 deletions(-)
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c index 35b42275a06c..d308ff744a29 100644 --- a/drivers/s390/net/qeth_l3_main.c +++ b/drivers/s390/net/qeth_l3_main.c @@ -1098,8 +1098,9 @@ static int qeth_l3_add_mcast_rtnl(struct net_device *dev, int vid, void *arg) tmp.disp_flag = QETH_DISP_ADDR_ADD; tmp.is_multicast = 1;
- read_lock_bh(&in6_dev->lock); - for (im6 = in6_dev->mc_list; im6 != NULL; im6 = im6->next) { + for (im6 = rtnl_dereference(in6_dev->mc_list); + im6; + im6 = rtnl_dereference(im6->next)) { tmp.u.a6.addr = im6->mca_addr;
ipm = qeth_l3_find_addr_by_ip(card, &tmp); @@ -1117,7 +1118,6 @@ static int qeth_l3_add_mcast_rtnl(struct net_device *dev, int vid, void *arg) qeth_l3_ipaddr_hash(ipm));
} - read_unlock_bh(&in6_dev->lock);
out: return 0; diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 7875a3208426..521158e05c18 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -115,7 +115,7 @@ struct ip6_sf_list { struct ifmcaddr6 { struct in6_addr mca_addr; struct inet6_dev *idev; - struct ifmcaddr6 *next; + struct ifmcaddr6 __rcu *next; struct ip6_sf_list __rcu *mca_sources; struct ip6_sf_list __rcu *mca_tomb; unsigned int mca_sfmode; @@ -128,6 +128,7 @@ struct ifmcaddr6 { spinlock_t mca_lock; unsigned long mca_cstamp; unsigned long mca_tstamp; + struct rcu_head rcu; };
/* Anycast stuff */ @@ -166,8 +167,8 @@ struct inet6_dev {
struct list_head addr_list;
- struct ifmcaddr6 *mc_list; - struct ifmcaddr6 *mc_tomb; + struct ifmcaddr6 __rcu *mc_list; + struct ifmcaddr6 __rcu *mc_tomb;
unsigned char mc_qrv; /* Query Robustness Variable */ unsigned char mc_gq_running; diff --git a/net/batman-adv/multicast.c b/net/batman-adv/multicast.c index 28166402d30c..1d63c8cbbfe7 100644 --- a/net/batman-adv/multicast.c +++ b/net/batman-adv/multicast.c @@ -454,8 +454,9 @@ batadv_mcast_mla_softif_get_ipv6(struct net_device *dev, return 0; }
- read_lock_bh(&in6_dev->lock); - for (pmc6 = in6_dev->mc_list; pmc6; pmc6 = pmc6->next) { + for (pmc6 = rcu_dereference(in6_dev->mc_list); + pmc6; + pmc6 = rcu_dereference(pmc6->next)) { if (IPV6_ADDR_MC_SCOPE(&pmc6->mca_addr) < IPV6_ADDR_SCOPE_LINKLOCAL) continue; @@ -484,7 +485,6 @@ batadv_mcast_mla_softif_get_ipv6(struct net_device *dev, hlist_add_head(&new->list, mcast_list); ret++; } - read_unlock_bh(&in6_dev->lock); rcu_read_unlock();
return ret; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index f2337fb756ac..b502f78d5091 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -5107,17 +5107,20 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, break; } case MULTICAST_ADDR: + read_unlock_bh(&idev->lock); fillargs->event = RTM_GETMULTICAST;
/* multicast address */ - for (ifmca = idev->mc_list; ifmca; - ifmca = ifmca->next, ip_idx++) { + for (ifmca = rcu_dereference(idev->mc_list); + ifmca; + ifmca = rcu_dereference(ifmca->next), ip_idx++) { if (ip_idx < s_ip_idx) continue; err = inet6_fill_ifmcaddr(skb, ifmca, fillargs); if (err < 0) break; } + read_lock_bh(&idev->lock); break; case ANYCAST_ADDR: fillargs->event = RTM_GETANYCAST; @@ -6093,10 +6096,8 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
static void ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp) { - rcu_read_lock_bh(); if (likely(ifp->idev->dead == 0)) __ipv6_ifa_notify(event, ifp); - rcu_read_unlock_bh(); }
#ifdef CONFIG_SYSCTL diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c index c70c192bc91b..a36626afbc02 100644 --- a/net/ipv6/addrconf_core.c +++ b/net/ipv6/addrconf_core.c @@ -250,7 +250,7 @@ void in6_dev_finish_destroy(struct inet6_dev *idev) struct net_device *dev = idev->dev;
WARN_ON(!list_empty(&idev->addr_list)); - WARN_ON(idev->mc_list); + WARN_ON(rcu_access_pointer(idev->mc_list)); WARN_ON(timer_pending(&idev->rs_timer));
#ifdef NET_REFCNT_DEBUG diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 802f5111805a..3c9bacffc9c3 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -222,7 +222,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol, inet->mc_loop = 1; inet->mc_ttl = 1; inet->mc_index = 0; - inet->mc_list = NULL; + RCU_INIT_POINTER(inet->mc_list, NULL); inet->rcv_tos = 0;
if (net->ipv4.sysctl_ip_no_pmtu_disc) diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index bc0fb4815c97..75541cf53153 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -112,6 +112,11 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; * socket join on multicast group */
+#define for_each_pmc_rtnl(np, pmc) \ + for (pmc = rtnl_dereference((np)->ipv6_mc_list); \ + pmc; \ + pmc = rtnl_dereference(pmc->next)) + #define for_each_pmc_rcu(np, pmc) \ for (pmc = rcu_dereference((np)->ipv6_mc_list); \ pmc; \ @@ -132,6 +137,21 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; psf; \ psf = rtnl_dereference(psf->sf_next))
+#define for_each_mc_rtnl(idev, mc) \ + for (mc = rtnl_dereference((idev)->mc_list); \ + mc; \ + mc = rtnl_dereference(mc->next)) + +#define for_each_mc_rcu(idev, mc) \ + for (mc = rcu_dereference((idev)->mc_list); \ + mc; \ + mc = rcu_dereference(mc->next)) + +#define for_each_mc_tomb(idev, mc) \ + for (mc = rtnl_dereference((idev)->mc_tomb); \ + mc; \ + mc = rtnl_dereference(mc->next)) + static int unsolicited_report_interval(struct inet6_dev *idev) { int iv; @@ -158,15 +178,11 @@ static int __ipv6_sock_mc_join(struct sock *sk, int ifindex, if (!ipv6_addr_is_multicast(addr)) return -EINVAL;
- rcu_read_lock(); - for_each_pmc_rcu(np, mc_lst) { + for_each_pmc_rtnl(np, mc_lst) { if ((ifindex == 0 || mc_lst->ifindex == ifindex) && - ipv6_addr_equal(&mc_lst->addr, addr)) { - rcu_read_unlock(); + ipv6_addr_equal(&mc_lst->addr, addr)) return -EADDRINUSE; - } } - rcu_read_unlock();
mc_lst = sock_kmalloc(sk, sizeof(struct ipv6_mc_socklist), GFP_KERNEL);
@@ -268,10 +284,9 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr) } EXPORT_SYMBOL(ipv6_sock_mc_drop);
-/* called with rcu_read_lock() */ -static struct inet6_dev *ip6_mc_find_dev_rcu(struct net *net, - const struct in6_addr *group, - int ifindex) +static struct inet6_dev *ip6_mc_find_dev_rtnl(struct net *net, + const struct in6_addr *group, + int ifindex) { struct net_device *dev = NULL; struct inet6_dev *idev = NULL; @@ -283,19 +298,17 @@ static struct inet6_dev *ip6_mc_find_dev_rcu(struct net *net, dev = rt->dst.dev; ip6_rt_put(rt); } - } else - dev = dev_get_by_index_rcu(net, ifindex); + } else { + dev = __dev_get_by_index(net, ifindex); + }
if (!dev) return NULL; idev = __in6_dev_get(dev); if (!idev) return NULL; - read_lock_bh(&idev->lock); - if (idev->dead) { - read_unlock_bh(&idev->lock); + if (idev->dead) return NULL; - } return idev; }
@@ -357,16 +370,13 @@ int ip6_mc_source(int add, int omode, struct sock *sk, if (!ipv6_addr_is_multicast(group)) return -EINVAL;
- rcu_read_lock(); - idev = ip6_mc_find_dev_rcu(net, group, pgsr->gsr_interface); - if (!idev) { - rcu_read_unlock(); + idev = ip6_mc_find_dev_rtnl(net, group, pgsr->gsr_interface); + if (!idev) return -ENODEV; - }
err = -EADDRNOTAVAIL;
- for_each_pmc_rcu(inet6, pmc) { + for_each_pmc_rtnl(inet6, pmc) { if (pgsr->gsr_interface && pmc->ifindex != pgsr->gsr_interface) continue; if (ipv6_addr_equal(&pmc->addr, group)) @@ -459,8 +469,6 @@ int ip6_mc_source(int add, int omode, struct sock *sk, /* update the interface list */ ip6_mc_add_src(idev, group, omode, 1, source, 1); done: - read_unlock_bh(&idev->lock); - rcu_read_unlock(); if (leavegroup) err = ipv6_sock_mc_drop(sk, pgsr->gsr_interface, group); return err; @@ -486,13 +494,9 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf, gsf->gf_fmode != MCAST_EXCLUDE) return -EINVAL;
- rcu_read_lock(); - idev = ip6_mc_find_dev_rcu(net, group, gsf->gf_interface); - - if (!idev) { - rcu_read_unlock(); + idev = ip6_mc_find_dev_rtnl(net, group, gsf->gf_interface); + if (!idev) return -ENODEV; - }
err = 0;
@@ -501,7 +505,7 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf, goto done; }
- for_each_pmc_rcu(inet6, pmc) { + for_each_pmc_rtnl(inet6, pmc) { if (pmc->ifindex != gsf->gf_interface) continue; if (ipv6_addr_equal(&pmc->addr, group)) @@ -548,8 +552,6 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf, pmc->sfmode = gsf->gf_fmode; err = 0; done: - read_unlock_bh(&idev->lock); - rcu_read_unlock(); if (leavegroup) err = ipv6_sock_mc_drop(sk, gsf->gf_interface, group); return err; @@ -571,13 +573,9 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf, if (!ipv6_addr_is_multicast(group)) return -EINVAL;
- rcu_read_lock(); - idev = ip6_mc_find_dev_rcu(net, group, gsf->gf_interface); - - if (!idev) { - rcu_read_unlock(); + idev = ip6_mc_find_dev_rtnl(net, group, gsf->gf_interface); + if (!idev) return -ENODEV; - }
err = -EADDRNOTAVAIL; /* changes to the ipv6_mc_list require the socket lock and @@ -585,19 +583,18 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf, * so reading the list is safe. */
- for_each_pmc_rcu(inet6, pmc) { + for_each_pmc_rtnl(inet6, pmc) { if (pmc->ifindex != gsf->gf_interface) continue; if (ipv6_addr_equal(group, &pmc->addr)) break; } if (!pmc) /* must have a prior join */ - goto done; + return err; + gsf->gf_fmode = pmc->sfmode; psl = rtnl_dereference(pmc->sflist); count = psl ? psl->sl_count : 0; - read_unlock_bh(&idev->lock); - rcu_read_unlock();
copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc; gsf->gf_numsrc = count; @@ -614,10 +611,6 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf, return -EFAULT; } return 0; -done: - read_unlock_bh(&idev->lock); - rcu_read_unlock(); - return err; }
bool inet6_mc_check(struct sock *sk, const struct in6_addr *mc_addr, @@ -761,8 +754,8 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) } spin_unlock_bh(&im->mca_lock);
- pmc->next = idev->mc_tomb; - idev->mc_tomb = pmc; + rcu_assign_pointer(pmc->next, idev->mc_tomb); + rcu_assign_pointer(idev->mc_tomb, pmc); }
static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) @@ -772,16 +765,16 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) struct ifmcaddr6 *pmc, *pmc_prev;
pmc_prev = NULL; - for (pmc = idev->mc_tomb; pmc; pmc = pmc->next) { + for_each_mc_tomb(idev, pmc) { if (ipv6_addr_equal(&pmc->mca_addr, pmca)) break; pmc_prev = pmc; } if (pmc) { if (pmc_prev) - pmc_prev->next = pmc->next; + rcu_assign_pointer(pmc_prev->next, pmc->next); else - idev->mc_tomb = pmc->next; + rcu_assign_pointer(idev->mc_tomb, pmc->next); }
spin_lock_bh(&im->mca_lock); @@ -804,7 +797,7 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) } in6_dev_put(pmc->idev); ip6_mc_clear_src(pmc); - kfree(pmc); + kfree_rcu(pmc, rcu); } spin_unlock_bh(&im->mca_lock); } @@ -813,19 +806,18 @@ static void mld_clear_delrec(struct inet6_dev *idev) { struct ifmcaddr6 *pmc, *nextpmc;
- pmc = idev->mc_tomb; - idev->mc_tomb = NULL; + pmc = rtnl_dereference(idev->mc_tomb); + RCU_INIT_POINTER(idev->mc_tomb, NULL);
for (; pmc; pmc = nextpmc) { - nextpmc = pmc->next; + nextpmc = rtnl_dereference(pmc->next); ip6_mc_clear_src(pmc); in6_dev_put(pmc->idev); - kfree(pmc); + kfree_rcu(pmc, rcu); }
/* clear dead sources, too */ - read_lock_bh(&idev->lock); - for (pmc = idev->mc_list; pmc; pmc = pmc->next) { + for_each_mc_rtnl(idev, pmc) { struct ip6_sf_list *psf, *psf_next;
spin_lock_bh(&pmc->mca_lock); @@ -837,7 +829,6 @@ static void mld_clear_delrec(struct inet6_dev *idev) kfree_rcu(psf, rcu); } } - read_unlock_bh(&idev->lock); }
static void mca_get(struct ifmcaddr6 *mc) @@ -849,7 +840,7 @@ static void ma_put(struct ifmcaddr6 *mc) { if (refcount_dec_and_test(&mc->mca_refcnt)) { in6_dev_put(mc->idev); - kfree(mc); + kfree_rcu(mc, rcu); } }
@@ -900,17 +891,14 @@ static int __ipv6_dev_mc_inc(struct net_device *dev, if (!idev) return -EINVAL;
- write_lock_bh(&idev->lock); if (idev->dead) { - write_unlock_bh(&idev->lock); in6_dev_put(idev); return -ENODEV; }
- for (mc = idev->mc_list; mc; mc = mc->next) { + for_each_mc_rtnl(idev, mc) { if (ipv6_addr_equal(&mc->mca_addr, addr)) { mc->mca_users++; - write_unlock_bh(&idev->lock); ip6_mc_add_src(idev, &mc->mca_addr, mode, 0, NULL, 0); in6_dev_put(idev); return 0; @@ -919,19 +907,14 @@ static int __ipv6_dev_mc_inc(struct net_device *dev,
mc = mca_alloc(idev, addr, mode); if (!mc) { - write_unlock_bh(&idev->lock); in6_dev_put(idev); return -ENOMEM; }
- mc->next = idev->mc_list; - idev->mc_list = mc; + rcu_assign_pointer(mc->next, idev->mc_list); + rcu_assign_pointer(idev->mc_list, mc);
- /* Hold this for the code below before we unlock, - * it is already exposed via idev->mc_list. - */ mca_get(mc); - write_unlock_bh(&idev->lock);
mld_del_delrec(idev, mc); igmp6_group_added(mc); @@ -950,16 +933,16 @@ EXPORT_SYMBOL(ipv6_dev_mc_inc); */ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr) { - struct ifmcaddr6 *ma, **map; + struct ifmcaddr6 *ma, __rcu **map;
ASSERT_RTNL();
- write_lock_bh(&idev->lock); - for (map = &idev->mc_list; (ma = *map) != NULL; map = &ma->next) { + for (map = &idev->mc_list; + (ma = rtnl_dereference(*map)); + map = &ma->next) { if (ipv6_addr_equal(&ma->mca_addr, addr)) { if (--ma->mca_users == 0) { *map = ma->next; - write_unlock_bh(&idev->lock);
igmp6_group_dropped(ma); ip6_mc_clear_src(ma); @@ -967,11 +950,9 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr) ma_put(ma); return 0; } - write_unlock_bh(&idev->lock); return 0; } } - write_unlock_bh(&idev->lock);
return -ENOENT; } @@ -1006,8 +987,7 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, rcu_read_lock(); idev = __in6_dev_get(dev); if (idev) { - read_lock_bh(&idev->lock); - for (mc = idev->mc_list; mc; mc = mc->next) { + for_each_mc_rcu(idev, mc) { if (ipv6_addr_equal(&mc->mca_addr, group)) break; } @@ -1030,7 +1010,6 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, } else rv = true; /* don't filter unspecified source */ } - read_unlock_bh(&idev->lock); } rcu_read_unlock(); return rv; @@ -1082,9 +1061,8 @@ static void mld_dad_stop_work(struct inet6_dev *idev) }
/* - * IGMP handling (alias multicast ICMPv6 messages) + * IGMP handling (alias multicast ICMPv6 messages) */ - static void igmp6_group_queried(struct ifmcaddr6 *ma, unsigned long resptime) { unsigned long delay = resptime; @@ -1422,15 +1400,14 @@ int igmp6_event_query(struct sk_buff *skb) return -EINVAL; }
- read_lock_bh(&idev->lock); if (group_type == IPV6_ADDR_ANY) { - for (ma = idev->mc_list; ma; ma = ma->next) { + for_each_mc_rcu(idev, ma) { spin_lock_bh(&ma->mca_lock); igmp6_group_queried(ma, max_delay); spin_unlock_bh(&ma->mca_lock); } } else { - for (ma = idev->mc_list; ma; ma = ma->next) { + for_each_mc_rcu(idev, ma) { if (!ipv6_addr_equal(group, &ma->mca_addr)) continue; spin_lock_bh(&ma->mca_lock); @@ -1452,7 +1429,6 @@ int igmp6_event_query(struct sk_buff *skb) break; } } - read_unlock_bh(&idev->lock);
return 0; } @@ -1493,18 +1469,17 @@ int igmp6_event_report(struct sk_buff *skb) * Cancel the work for this group */
- read_lock_bh(&idev->lock); - for (ma = idev->mc_list; ma; ma = ma->next) { + for_each_mc_rcu(idev, ma) { if (ipv6_addr_equal(&ma->mca_addr, &mld->mld_mca)) { spin_lock(&ma->mca_lock); if (cancel_delayed_work(&ma->mca_work)) refcount_dec(&ma->mca_refcnt); - ma->mca_flags &= ~(MAF_LAST_REPORTER|MAF_TIMER_RUNNING); + ma->mca_flags &= ~(MAF_LAST_REPORTER | + MAF_TIMER_RUNNING); spin_unlock(&ma->mca_lock); break; } } - read_unlock_bh(&idev->lock); return 0; }
@@ -1868,9 +1843,8 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc) struct sk_buff *skb = NULL; int type;
- read_lock_bh(&idev->lock); if (!pmc) { - for (pmc = idev->mc_list; pmc; pmc = pmc->next) { + for_each_mc_rtnl(idev, pmc) { if (pmc->mca_flags & MAF_NOREPORT) continue; spin_lock_bh(&pmc->mca_lock); @@ -1890,7 +1864,6 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc) skb = add_grec(skb, pmc, type, 0, 0, 0); spin_unlock_bh(&pmc->mca_lock); } - read_unlock_bh(&idev->lock); if (skb) mld_sendpack(skb); } @@ -1927,12 +1900,12 @@ static void mld_send_cr(struct inet6_dev *idev) struct sk_buff *skb = NULL; int type, dtype;
- read_lock_bh(&idev->lock); - /* deleted MCA's */ pmc_prev = NULL; - for (pmc = idev->mc_tomb; pmc; pmc = pmc_next) { - pmc_next = pmc->next; + for (pmc = rtnl_dereference(idev->mc_tomb); + pmc; + pmc = pmc_next) { + pmc_next = rtnl_dereference(pmc->next); if (pmc->mca_sfmode == MCAST_INCLUDE) { type = MLD2_BLOCK_OLD_SOURCES; dtype = MLD2_BLOCK_OLD_SOURCES; @@ -1954,17 +1927,17 @@ static void mld_send_cr(struct inet6_dev *idev) !rcu_access_pointer(pmc->mca_tomb) && !rcu_access_pointer(pmc->mca_sources)) { if (pmc_prev) - pmc_prev->next = pmc_next; + rcu_assign_pointer(pmc_prev->next, pmc_next); else - idev->mc_tomb = pmc_next; + rcu_assign_pointer(idev->mc_tomb, pmc_next); in6_dev_put(pmc->idev); - kfree(pmc); + kfree_rcu(pmc, rcu); } else pmc_prev = pmc; }
/* change recs */ - for (pmc = idev->mc_list; pmc; pmc = pmc->next) { + for_each_mc_rtnl(idev, pmc) { spin_lock_bh(&pmc->mca_lock); if (pmc->mca_sfcount[MCAST_EXCLUDE]) { type = MLD2_BLOCK_OLD_SOURCES; @@ -1987,7 +1960,6 @@ static void mld_send_cr(struct inet6_dev *idev) } spin_unlock_bh(&pmc->mca_lock); } - read_unlock_bh(&idev->lock); if (!skb) return; (void) mld_sendpack(skb); @@ -2099,8 +2071,7 @@ static void mld_send_initial_cr(struct inet6_dev *idev) return;
skb = NULL; - read_lock_bh(&idev->lock); - for (pmc = idev->mc_list; pmc; pmc = pmc->next) { + for_each_mc_rtnl(idev, pmc) { spin_lock_bh(&pmc->mca_lock); if (pmc->mca_sfcount[MCAST_EXCLUDE]) type = MLD2_CHANGE_TO_EXCLUDE; @@ -2109,7 +2080,6 @@ static void mld_send_initial_cr(struct inet6_dev *idev) skb = add_grec(skb, pmc, type, 0, 0, 1); spin_unlock_bh(&pmc->mca_lock); } - read_unlock_bh(&idev->lock); if (skb) mld_sendpack(skb); } @@ -2132,7 +2102,9 @@ static void mld_dad_work(struct work_struct *work) struct inet6_dev, mc_dad_work);
+ rtnl_lock(); mld_send_initial_cr(idev); + rtnl_unlock(); if (idev->mc_dad_count) { idev->mc_dad_count--; if (idev->mc_dad_count) @@ -2194,24 +2166,22 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca,
if (!idev) return -ENODEV; - read_lock_bh(&idev->lock); - for (pmc = idev->mc_list; pmc; pmc = pmc->next) { + + for_each_mc_rtnl(idev, pmc) { if (ipv6_addr_equal(pmca, &pmc->mca_addr)) break; } - if (!pmc) { - /* MCA not found?? bug */ - read_unlock_bh(&idev->lock); + if (!pmc) return -ESRCH; - } spin_lock_bh(&pmc->mca_lock); + sf_markstate(pmc); if (!delta) { if (!pmc->mca_sfcount[sfmode]) { spin_unlock_bh(&pmc->mca_lock); - read_unlock_bh(&idev->lock); return -EINVAL; } + pmc->mca_sfcount[sfmode]--; } err = 0; @@ -2237,7 +2207,6 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, } else if (sf_setstate(pmc) || changerec) mld_ifc_event(pmc->idev); spin_unlock_bh(&pmc->mca_lock); - read_unlock_bh(&idev->lock); return err; }
@@ -2363,16 +2332,13 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca,
if (!idev) return -ENODEV; - read_lock_bh(&idev->lock); - for (pmc = idev->mc_list; pmc; pmc = pmc->next) { + + for_each_mc_rtnl(idev, pmc) { if (ipv6_addr_equal(pmca, &pmc->mca_addr)) break; } - if (!pmc) { - /* MCA not found?? bug */ - read_unlock_bh(&idev->lock); + if (!pmc) return -ESRCH; - } spin_lock_bh(&pmc->mca_lock);
sf_markstate(pmc); @@ -2407,10 +2373,10 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca, for_each_psf_rtnl(pmc, psf) psf->sf_crcount = 0; mld_ifc_event(idev); - } else if (sf_setstate(pmc)) + } else if (sf_setstate(pmc)) { mld_ifc_event(idev); + } spin_unlock_bh(&pmc->mca_lock); - read_unlock_bh(&idev->lock); return err; }
@@ -2485,9 +2451,10 @@ static int ip6_mc_leave_src(struct sock *sk, struct ipv6_mc_socklist *iml, static void igmp6_leave_group(struct ifmcaddr6 *ma) { if (mld_in_v1_mode(ma->idev)) { - if (ma->mca_flags & MAF_LAST_REPORTER) + if (ma->mca_flags & MAF_LAST_REPORTER) { igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REDUCTION); + } } else { mld_add_delrec(ma->idev, ma); mld_ifc_event(ma->idev); @@ -2500,8 +2467,12 @@ static void mld_gq_work(struct work_struct *work) struct inet6_dev, mc_gq_work);
- idev->mc_gq_running = 0; + rtnl_lock(); mld_send_report(idev, NULL); + rtnl_unlock(); + + idev->mc_gq_running = 0; + in6_dev_put(idev); }
@@ -2511,7 +2482,10 @@ static void mld_ifc_work(struct work_struct *work) struct inet6_dev, mc_ifc_work);
+ rtnl_lock(); mld_send_cr(idev); + rtnl_unlock(); + if (idev->mc_ifc_count) { idev->mc_ifc_count--; if (idev->mc_ifc_count) @@ -2525,6 +2499,7 @@ static void mld_ifc_event(struct inet6_dev *idev) { if (mld_in_v1_mode(idev)) return; + idev->mc_ifc_count = idev->mc_qrv; mld_ifc_start_work(idev, 1); } @@ -2534,10 +2509,12 @@ static void mld_mca_work(struct work_struct *work) struct ifmcaddr6 *ma = container_of(to_delayed_work(work), struct ifmcaddr6, mca_work);
+ rtnl_lock(); if (mld_in_v1_mode(ma->idev)) igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REPORT); else mld_send_report(ma->idev, ma); + rtnl_unlock();
spin_lock_bh(&ma->mca_lock); ma->mca_flags |= MAF_LAST_REPORTER; @@ -2554,10 +2531,8 @@ void ipv6_mc_unmap(struct inet6_dev *idev)
/* Install multicast list, except for all-nodes (already installed) */
- read_lock_bh(&idev->lock); - for (i = idev->mc_list; i; i = i->next) + for_each_mc_rtnl(idev, i) igmp6_group_dropped(i); - read_unlock_bh(&idev->lock); }
void ipv6_mc_remap(struct inet6_dev *idev) @@ -2572,10 +2547,7 @@ void ipv6_mc_down(struct inet6_dev *idev) struct ifmcaddr6 *i;
/* Withdraw multicast list */ - - read_lock_bh(&idev->lock); - - for (i = idev->mc_list; i; i = i->next) + for_each_mc_rtnl(idev, i) igmp6_group_dropped(i);
/* Should stop work after group drop. or we will @@ -2584,7 +2556,6 @@ void ipv6_mc_down(struct inet6_dev *idev) mld_ifc_stop_work(idev); mld_gq_stop_work(idev); mld_dad_stop_work(idev); - read_unlock_bh(&idev->lock); }
static void ipv6_mc_reset(struct inet6_dev *idev) @@ -2604,28 +2575,24 @@ void ipv6_mc_up(struct inet6_dev *idev)
/* Install multicast list, except for all-nodes (already installed) */
- read_lock_bh(&idev->lock); ipv6_mc_reset(idev); - for (i = idev->mc_list; i; i = i->next) { + for_each_mc_rtnl(idev, i) { mld_del_delrec(idev, i); igmp6_group_added(i); } - read_unlock_bh(&idev->lock); }
/* IPv6 device initialization. */
void ipv6_mc_init_dev(struct inet6_dev *idev) { - write_lock_bh(&idev->lock); idev->mc_gq_running = 0; INIT_DELAYED_WORK(&idev->mc_gq_work, mld_gq_work); - idev->mc_tomb = NULL; + RCU_INIT_POINTER(idev->mc_tomb, NULL); idev->mc_ifc_count = 0; INIT_DELAYED_WORK(&idev->mc_ifc_work, mld_ifc_work); INIT_DELAYED_WORK(&idev->mc_dad_work, mld_dad_work); ipv6_mc_reset(idev); - write_unlock_bh(&idev->lock); }
/* @@ -2650,16 +2617,12 @@ void ipv6_mc_destroy_dev(struct inet6_dev *idev) if (idev->cnf.forwarding) __ipv6_dev_mc_dec(idev, &in6addr_linklocal_allrouters);
- write_lock_bh(&idev->lock); - while ((i = idev->mc_list) != NULL) { - idev->mc_list = i->next; + while ((i = rtnl_dereference(idev->mc_list))) { + rcu_assign_pointer(idev->mc_list, rtnl_dereference(i->next));
- write_unlock_bh(&idev->lock); ip6_mc_clear_src(i); ma_put(i); - write_lock_bh(&idev->lock); } - write_unlock_bh(&idev->lock); }
static void ipv6_mc_rejoin_groups(struct inet6_dev *idev) @@ -2669,12 +2632,11 @@ static void ipv6_mc_rejoin_groups(struct inet6_dev *idev) ASSERT_RTNL();
if (mld_in_v1_mode(idev)) { - read_lock_bh(&idev->lock); - for (pmc = idev->mc_list; pmc; pmc = pmc->next) + for_each_mc_rtnl(idev, pmc) igmp6_join_group(pmc); - read_unlock_bh(&idev->lock); - } else + } else { mld_send_report(idev, NULL); + } }
static int ipv6_mc_netdev_event(struct notifier_block *this, @@ -2721,13 +2683,12 @@ static inline struct ifmcaddr6 *igmp6_mc_get_first(struct seq_file *seq) idev = __in6_dev_get(state->dev); if (!idev) continue; - read_lock_bh(&idev->lock); - im = idev->mc_list; + + im = rcu_dereference(idev->mc_list); if (im) { state->idev = idev; break; } - read_unlock_bh(&idev->lock); } return im; } @@ -2736,11 +2697,8 @@ static struct ifmcaddr6 *igmp6_mc_get_next(struct seq_file *seq, struct ifmcaddr { struct igmp6_mc_iter_state *state = igmp6_mc_seq_private(seq);
- im = im->next; + im = rcu_dereference(im->next); while (!im) { - if (likely(state->idev)) - read_unlock_bh(&state->idev->lock); - state->dev = next_net_device_rcu(state->dev); if (!state->dev) { state->idev = NULL; @@ -2749,8 +2707,7 @@ static struct ifmcaddr6 *igmp6_mc_get_next(struct seq_file *seq, struct ifmcaddr state->idev = __in6_dev_get(state->dev); if (!state->idev) continue; - read_lock_bh(&state->idev->lock); - im = state->idev->mc_list; + im = rcu_dereference(state->idev->mc_list); } return im; } @@ -2784,10 +2741,8 @@ static void igmp6_mc_seq_stop(struct seq_file *seq, void *v) { struct igmp6_mc_iter_state *state = igmp6_mc_seq_private(seq);
- if (likely(state->idev)) { - read_unlock_bh(&state->idev->lock); + if (likely(state->idev)) state->idev = NULL; - } state->dev = NULL; rcu_read_unlock(); } @@ -2802,7 +2757,7 @@ static int igmp6_mc_seq_show(struct seq_file *seq, void *v) state->dev->ifindex, state->dev->name, &im->mca_addr, im->mca_users, im->mca_flags, - (im->mca_flags&MAF_TIMER_RUNNING) ? + (im->mca_flags & MAF_TIMER_RUNNING) ? jiffies_to_clock_t(im->mca_work.timer.expires - jiffies) : 0); return 0; } @@ -2837,8 +2792,8 @@ static inline struct ip6_sf_list *igmp6_mcf_get_first(struct seq_file *seq) idev = __in6_dev_get(state->dev); if (unlikely(idev == NULL)) continue; - read_lock_bh(&idev->lock); - im = idev->mc_list; + + im = rcu_dereference(idev->mc_list); if (likely(im)) { spin_lock_bh(&im->mca_lock); psf = rcu_dereference(im->mca_sources); @@ -2849,7 +2804,6 @@ static inline struct ip6_sf_list *igmp6_mcf_get_first(struct seq_file *seq) } spin_unlock_bh(&im->mca_lock); } - read_unlock_bh(&idev->lock); } return psf; } @@ -2861,11 +2815,8 @@ static struct ip6_sf_list *igmp6_mcf_get_next(struct seq_file *seq, struct ip6_s psf = rcu_dereference(psf->sf_next); while (!psf) { spin_unlock_bh(&state->im->mca_lock); - state->im = state->im->next; + state->im = rcu_dereference(state->im->next); while (!state->im) { - if (likely(state->idev)) - read_unlock_bh(&state->idev->lock); - state->dev = next_net_device_rcu(state->dev); if (!state->dev) { state->idev = NULL; @@ -2874,8 +2825,7 @@ static struct ip6_sf_list *igmp6_mcf_get_next(struct seq_file *seq, struct ip6_s state->idev = __in6_dev_get(state->dev); if (!state->idev) continue; - read_lock_bh(&state->idev->lock); - state->im = state->idev->mc_list; + state->im = rcu_dereference(state->idev->mc_list); } if (!state->im) break; @@ -2917,14 +2867,14 @@ static void igmp6_mcf_seq_stop(struct seq_file *seq, void *v) __releases(RCU) { struct igmp6_mcf_iter_state *state = igmp6_mcf_seq_private(seq); + if (likely(state->im)) { spin_unlock_bh(&state->im->mca_lock); state->im = NULL; } - if (likely(state->idev)) { - read_unlock_bh(&state->idev->lock); + if (likely(state->idev)) state->idev = NULL; - } + state->dev = NULL; rcu_read_unlock(); }
On 3/25/21 5:16 PM, Taehee Yoo wrote:
The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ifmcaddr6 actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable.
I do not really understand the changelog.
You wanted to convert from RCU to RTNL, right ?
Also :
@@ -571,13 +573,9 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf, if (!ipv6_addr_is_multicast(group)) return -EINVAL;
- rcu_read_lock();
- idev = ip6_mc_find_dev_rcu(net, group, gsf->gf_interface);
- if (!idev) {
rcu_read_unlock();
- idev = ip6_mc_find_dev_rtnl(net, group, gsf->gf_interface);
- if (!idev) return -ENODEV;
- }
I do not see RTNL being acquired before entering ip6_mc_msfget()
On 2021. 3. 30. ì˜¤ì „ 4:56, Eric Dumazet wrote:
Hi Eric, Thank you for the review!
On 3/25/21 5:16 PM, Taehee Yoo wrote:
The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ifmcaddr6 actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable.
I do not really understand the changelog.
You wanted to convert from RCU to RTNL, right ?
The purpose of this is to use both RCU and RTNL. In the control path, ifmcaddr6 is protected by RTNL (setsockopt_needs_rtnl() in the do_ipv6_setsockopt()) And in the data path, ifmcaddr6 will be protected by RCU.
But ifmcaddr6 is already protected by RTNL in the control path. So, this patch is to convert ifmcaddr6 to RCU only for datapath. Therefore, by this patch, ifmcaddr6 will be protected by both RTNL and RCU.
I'm so sorry for this strange changelog.
Also :
@@ -571,13 +573,9 @@ int ip6_mc_msfget(struct sock *sk, struct
group_filter *gsf,
if (!ipv6_addr_is_multicast(group)) return -EINVAL;
- rcu_read_lock();
- idev = ip6_mc_find_dev_rcu(net, group, gsf->gf_interface);
- if (!idev) {
rcu_read_unlock();
- idev = ip6_mc_find_dev_rtnl(net, group, gsf->gf_interface);
- if (!idev) return -ENODEV;
- }
I do not see RTNL being acquired before entering ip6_mc_msfget()
Thank you so much for catching this. I will send a patch to fix this problem!
Thanks a lot! Taehee Yoo
When query/report packets are received, mld module processes them. But they are processed under BH context so it couldn't use sleepable functions. So, in order to switch context, the two workqueues are added which processes query and report event.
In the struct inet6_dev, mc_{query | report}_queue are added so it is per-interface queue. And mc_{query | report}_work are workqueue structure.
When the query or report event is received, skb is queued to proper queue and worker function is scheduled immediately. Workqueues and queues are protected by spinlock, which is mc_{query | report}_lock, and worker functions are protected by RTNL.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v3: - Initial patch
include/net/if_inet6.h | 9 +- include/net/mld.h | 3 + net/ipv6/icmp.c | 4 +- net/ipv6/mcast.c | 280 +++++++++++++++++++++++++++++------------ 4 files changed, 210 insertions(+), 86 deletions(-)
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 521158e05c18..882e0f88756f 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -125,7 +125,6 @@ struct ifmcaddr6 { unsigned int mca_flags; int mca_users; refcount_t mca_refcnt; - spinlock_t mca_lock; unsigned long mca_cstamp; unsigned long mca_tstamp; struct rcu_head rcu; @@ -183,6 +182,14 @@ struct inet6_dev { struct delayed_work mc_gq_work; /* general query work */ struct delayed_work mc_ifc_work; /* interface change work */ struct delayed_work mc_dad_work; /* dad complete mc work */ + struct delayed_work mc_query_work; /* mld query work */ + struct delayed_work mc_report_work; /* mld report work */ + + struct sk_buff_head mc_query_queue; /* mld query queue */ + struct sk_buff_head mc_report_queue; /* mld report queue */ + + spinlock_t mc_query_lock; /* mld query queue lock */ + spinlock_t mc_report_lock; /* mld query report lock */
struct ifacaddr6 *ac_list; rwlock_t lock; diff --git a/include/net/mld.h b/include/net/mld.h index 496bddb59942..c07359808493 100644 --- a/include/net/mld.h +++ b/include/net/mld.h @@ -92,6 +92,9 @@ struct mld2_query { #define MLD_EXP_MIN_LIMIT 32768UL #define MLDV1_MRD_MAX_COMPAT (MLD_EXP_MIN_LIMIT - 1)
+#define MLD_MAX_QUEUE 8 +#define MLD_MAX_SKBS 32 + static inline unsigned long mldv2_mrc(const struct mld2_query *mlh2) { /* RFC3810, 5.1.3. Maximum Response Code */ diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index fd1f896115c1..29d38d6b55fb 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -944,11 +944,11 @@ static int icmpv6_rcv(struct sk_buff *skb)
case ICMPV6_MGM_QUERY: igmp6_event_query(skb); - break; + return 0;
case ICMPV6_MGM_REPORT: igmp6_event_report(skb); - break; + return 0;
case ICMPV6_MGM_REDUCTION: case ICMPV6_NI_QUERY: diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 75541cf53153..3ad754388933 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -439,7 +439,7 @@ int ip6_mc_source(int add, int omode, struct sock *sk,
if (psl) count += psl->sl_max; - newpsl = sock_kmalloc(sk, IP6_SFLSIZE(count), GFP_ATOMIC); + newpsl = sock_kmalloc(sk, IP6_SFLSIZE(count), GFP_KERNEL); if (!newpsl) { err = -ENOBUFS; goto done; @@ -517,7 +517,7 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf, } if (gsf->gf_numsrc) { newpsl = sock_kmalloc(sk, IP6_SFLSIZE(gsf->gf_numsrc), - GFP_ATOMIC); + GFP_KERNEL); if (!newpsl) { err = -ENOBUFS; goto done; @@ -659,13 +659,11 @@ static void igmp6_group_added(struct ifmcaddr6 *mc) IPV6_ADDR_SCOPE_LINKLOCAL) return;
- spin_lock_bh(&mc->mca_lock); if (!(mc->mca_flags&MAF_LOADED)) { mc->mca_flags |= MAF_LOADED; if (ndisc_mc_map(&mc->mca_addr, buf, dev, 0) == 0) dev_mc_add(dev, buf); } - spin_unlock_bh(&mc->mca_lock);
if (!(dev->flags & IFF_UP) || (mc->mca_flags & MAF_NOREPORT)) return; @@ -695,24 +693,20 @@ static void igmp6_group_dropped(struct ifmcaddr6 *mc) IPV6_ADDR_SCOPE_LINKLOCAL) return;
- spin_lock_bh(&mc->mca_lock); if (mc->mca_flags&MAF_LOADED) { mc->mca_flags &= ~MAF_LOADED; if (ndisc_mc_map(&mc->mca_addr, buf, dev, 0) == 0) dev_mc_del(dev, buf); }
- spin_unlock_bh(&mc->mca_lock); if (mc->mca_flags & MAF_NOREPORT) return;
if (!mc->idev->dead) igmp6_leave_group(mc);
- spin_lock_bh(&mc->mca_lock); if (cancel_delayed_work(&mc->mca_work)) refcount_dec(&mc->mca_refcnt); - spin_unlock_bh(&mc->mca_lock); }
/* @@ -728,12 +722,10 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) * for deleted items allows change reports to use common code with * non-deleted or query-response MCA's. */ - pmc = kzalloc(sizeof(*pmc), GFP_ATOMIC); + pmc = kzalloc(sizeof(*pmc), GFP_KERNEL); if (!pmc) return;
- spin_lock_bh(&im->mca_lock); - spin_lock_init(&pmc->mca_lock); pmc->idev = im->idev; in6_dev_hold(idev); pmc->mca_addr = im->mca_addr; @@ -752,7 +744,6 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) for_each_psf_rtnl(pmc, psf) psf->sf_crcount = pmc->mca_crcount; } - spin_unlock_bh(&im->mca_lock);
rcu_assign_pointer(pmc->next, idev->mc_tomb); rcu_assign_pointer(idev->mc_tomb, pmc); @@ -777,7 +768,6 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) rcu_assign_pointer(idev->mc_tomb, pmc->next); }
- spin_lock_bh(&im->mca_lock); if (pmc) { im->idev = pmc->idev; if (im->mca_sfmode == MCAST_INCLUDE) { @@ -799,7 +789,6 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) ip6_mc_clear_src(pmc); kfree_rcu(pmc, rcu); } - spin_unlock_bh(&im->mca_lock); }
static void mld_clear_delrec(struct inet6_dev *idev) @@ -820,10 +809,8 @@ static void mld_clear_delrec(struct inet6_dev *idev) for_each_mc_rtnl(idev, pmc) { struct ip6_sf_list *psf, *psf_next;
- spin_lock_bh(&pmc->mca_lock); psf = rtnl_dereference(pmc->mca_tomb); RCU_INIT_POINTER(pmc->mca_tomb, NULL); - spin_unlock_bh(&pmc->mca_lock); for (; psf; psf = psf_next) { psf_next = rtnl_dereference(psf->sf_next); kfree_rcu(psf, rcu); @@ -831,6 +818,26 @@ static void mld_clear_delrec(struct inet6_dev *idev) } }
+static void mld_clear_query(struct inet6_dev *idev) +{ + struct sk_buff *skb; + + spin_lock_bh(&idev->mc_query_lock); + while ((skb = __skb_dequeue(&idev->mc_query_queue))) + kfree_skb(skb); + spin_unlock_bh(&idev->mc_query_lock); +} + +static void mld_clear_report(struct inet6_dev *idev) +{ + struct sk_buff *skb; + + spin_lock_bh(&idev->mc_report_lock); + while ((skb = __skb_dequeue(&idev->mc_report_queue))) + kfree_skb(skb); + spin_unlock_bh(&idev->mc_report_lock); +} + static void mca_get(struct ifmcaddr6 *mc) { refcount_inc(&mc->mca_refcnt); @@ -850,7 +857,7 @@ static struct ifmcaddr6 *mca_alloc(struct inet6_dev *idev, { struct ifmcaddr6 *mc;
- mc = kzalloc(sizeof(*mc), GFP_ATOMIC); + mc = kzalloc(sizeof(*mc), GFP_KERNEL); if (!mc) return NULL;
@@ -862,7 +869,6 @@ static struct ifmcaddr6 *mca_alloc(struct inet6_dev *idev, /* mca_stamp should be updated upon changes */ mc->mca_cstamp = mc->mca_tstamp = jiffies; refcount_set(&mc->mca_refcnt, 1); - spin_lock_init(&mc->mca_lock);
mc->mca_sfmode = mode; mc->mca_sfcount[mode] = 1; @@ -995,7 +1001,6 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, if (src_addr && !ipv6_addr_any(src_addr)) { struct ip6_sf_list *psf;
- spin_lock_bh(&mc->mca_lock); for_each_psf_rcu(mc, psf) { if (ipv6_addr_equal(&psf->sf_addr, src_addr)) break; @@ -1006,7 +1011,6 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, mc->mca_sfcount[MCAST_EXCLUDE]; else rv = mc->mca_sfcount[MCAST_EXCLUDE] != 0; - spin_unlock_bh(&mc->mca_lock); } else rv = true; /* don't filter unspecified source */ } @@ -1060,6 +1064,20 @@ static void mld_dad_stop_work(struct inet6_dev *idev) __in6_dev_put(idev); }
+static void mld_query_stop_work(struct inet6_dev *idev) +{ + spin_lock_bh(&idev->mc_query_lock); + if (cancel_delayed_work(&idev->mc_query_work)) + __in6_dev_put(idev); + spin_unlock_bh(&idev->mc_query_lock); +} + +static void mld_report_stop_work(struct inet6_dev *idev) +{ + if (cancel_delayed_work_sync(&idev->mc_report_work)) + __in6_dev_put(idev); +} + /* * IGMP handling (alias multicast ICMPv6 messages) */ @@ -1093,7 +1111,7 @@ static bool mld_xmarksources(struct ifmcaddr6 *pmc, int nsrcs, int i, scount;
scount = 0; - for_each_psf_rcu(pmc, psf) { + for_each_psf_rtnl(pmc, psf) { if (scount == nsrcs) break; for (i = 0; i < nsrcs; i++) { @@ -1126,7 +1144,7 @@ static bool mld_marksources(struct ifmcaddr6 *pmc, int nsrcs, /* mark INCLUDE-mode sources */
scount = 0; - for_each_psf_rcu(pmc, psf) { + for_each_psf_rtnl(pmc, psf) { if (scount == nsrcs) break; for (i = 0; i < nsrcs; i++) { @@ -1317,19 +1335,42 @@ static int mld_process_v2(struct inet6_dev *idev, struct mld2_query *mld,
/* called with rcu_read_lock() */ int igmp6_event_query(struct sk_buff *skb) +{ + struct inet6_dev *idev = __in6_dev_get(skb->dev); + + if (!idev) + return -EINVAL; + + if (idev->dead) { + kfree_skb(skb); + return -ENODEV; + } + + spin_lock_bh(&idev->mc_query_lock); + if (skb_queue_len(&idev->mc_query_queue) < MLD_MAX_SKBS) { + __skb_queue_tail(&idev->mc_query_queue, skb); + if (!mod_delayed_work(mld_wq, &idev->mc_query_work, 0)) + in6_dev_hold(idev); + } + spin_unlock_bh(&idev->mc_query_lock); + + return 0; +} + +static void __mld_query_work(struct sk_buff *skb) { struct mld2_query *mlh2 = NULL; - struct ifmcaddr6 *ma; const struct in6_addr *group; unsigned long max_delay; struct inet6_dev *idev; + struct ifmcaddr6 *ma; struct mld_msg *mld; int group_type; int mark = 0; int len, err;
if (!pskb_may_pull(skb, sizeof(struct in6_addr))) - return -EINVAL; + goto out;
/* compute payload length excluding extension headers */ len = ntohs(ipv6_hdr(skb)->payload_len) + sizeof(struct ipv6hdr); @@ -1346,11 +1387,11 @@ int igmp6_event_query(struct sk_buff *skb) ipv6_hdr(skb)->hop_limit != 1 || !(IP6CB(skb)->flags & IP6SKB_ROUTERALERT) || IP6CB(skb)->ra != htons(IPV6_OPT_ROUTERALERT_MLD)) - return -EINVAL; + goto out;
idev = __in6_dev_get(skb->dev); if (!idev) - return 0; + goto out;
mld = (struct mld_msg *)icmp6_hdr(skb); group = &mld->mld_mca; @@ -1358,59 +1399,56 @@ int igmp6_event_query(struct sk_buff *skb)
if (group_type != IPV6_ADDR_ANY && !(group_type&IPV6_ADDR_MULTICAST)) - return -EINVAL; + goto out;
if (len < MLD_V1_QUERY_LEN) { - return -EINVAL; + goto out; } else if (len == MLD_V1_QUERY_LEN || mld_in_v1_mode(idev)) { err = mld_process_v1(idev, mld, &max_delay, len == MLD_V1_QUERY_LEN); if (err < 0) - return err; + goto out; } else if (len >= MLD_V2_QUERY_LEN_MIN) { int srcs_offset = sizeof(struct mld2_query) - sizeof(struct icmp6hdr);
if (!pskb_may_pull(skb, srcs_offset)) - return -EINVAL; + goto out;
mlh2 = (struct mld2_query *)skb_transport_header(skb);
err = mld_process_v2(idev, mlh2, &max_delay); if (err < 0) - return err; + goto out;
if (group_type == IPV6_ADDR_ANY) { /* general query */ if (mlh2->mld2q_nsrcs) - return -EINVAL; /* no sources allowed */ + goto out; /* no sources allowed */
mld_gq_start_work(idev); - return 0; + goto out; } /* mark sources to include, if group & source-specific */ if (mlh2->mld2q_nsrcs != 0) { if (!pskb_may_pull(skb, srcs_offset + ntohs(mlh2->mld2q_nsrcs) * sizeof(struct in6_addr))) - return -EINVAL; + goto out;
mlh2 = (struct mld2_query *)skb_transport_header(skb); mark = 1; } } else { - return -EINVAL; + goto out; }
if (group_type == IPV6_ADDR_ANY) { - for_each_mc_rcu(idev, ma) { - spin_lock_bh(&ma->mca_lock); + for_each_mc_rtnl(idev, ma) { igmp6_group_queried(ma, max_delay); - spin_unlock_bh(&ma->mca_lock); } } else { - for_each_mc_rcu(idev, ma) { + for_each_mc_rtnl(idev, ma) { if (!ipv6_addr_equal(group, &ma->mca_addr)) continue; - spin_lock_bh(&ma->mca_lock); if (ma->mca_flags & MAF_TIMER_RUNNING) { /* gsquery <- gsquery && mark */ if (!mark) @@ -1425,16 +1463,72 @@ int igmp6_event_query(struct sk_buff *skb) if (!(ma->mca_flags & MAF_GSQUERY) || mld_marksources(ma, ntohs(mlh2->mld2q_nsrcs), mlh2->mld2q_srcs)) igmp6_group_queried(ma, max_delay); - spin_unlock_bh(&ma->mca_lock); break; } }
- return 0; +out: + consume_skb(skb); +} + +static void mld_query_work(struct work_struct *work) +{ + struct inet6_dev *idev = container_of(to_delayed_work(work), + struct inet6_dev, + mc_query_work); + struct sk_buff_head q; + struct sk_buff *skb; + bool rework = false; + int cnt = 0; + + skb_queue_head_init(&q); + + spin_lock_bh(&idev->mc_query_lock); + while ((skb = __skb_dequeue(&idev->mc_query_queue))) { + __skb_queue_tail(&q, skb); + + if (++cnt >= MLD_MAX_QUEUE) { + rework = true; + schedule_delayed_work(&idev->mc_query_work, 0); + break; + } + } + spin_unlock_bh(&idev->mc_query_lock); + + rtnl_lock(); + while ((skb = __skb_dequeue(&q))) + __mld_query_work(skb); + rtnl_unlock(); + + if (!rework) + in6_dev_put(idev); }
/* called with rcu_read_lock() */ int igmp6_event_report(struct sk_buff *skb) +{ + struct inet6_dev *idev = __in6_dev_get(skb->dev); + + if (!idev) + return -EINVAL; + + if (idev->dead) { + kfree_skb(skb); + return -ENODEV; + } + + spin_lock_bh(&idev->mc_report_lock); + if (skb_queue_len(&idev->mc_report_queue) < MLD_MAX_SKBS) { + __skb_queue_tail(&idev->mc_report_queue, skb); + if (!mod_delayed_work(mld_wq, &idev->mc_report_work, 0)) + in6_dev_hold(idev); + } + spin_unlock_bh(&idev->mc_report_lock); + + return 0; +} + +static void __mld_report_work(struct sk_buff *skb) { struct ifmcaddr6 *ma; struct inet6_dev *idev; @@ -1443,15 +1537,15 @@ int igmp6_event_report(struct sk_buff *skb)
/* Our own report looped back. Ignore it. */ if (skb->pkt_type == PACKET_LOOPBACK) - return 0; + goto out;
/* send our report if the MC router may not have heard this report */ if (skb->pkt_type != PACKET_MULTICAST && skb->pkt_type != PACKET_BROADCAST) - return 0; + goto out;
if (!pskb_may_pull(skb, sizeof(*mld) - sizeof(struct icmp6hdr))) - return -EINVAL; + goto out;
mld = (struct mld_msg *)icmp6_hdr(skb);
@@ -1459,28 +1553,60 @@ int igmp6_event_report(struct sk_buff *skb) addr_type = ipv6_addr_type(&ipv6_hdr(skb)->saddr); if (addr_type != IPV6_ADDR_ANY && !(addr_type&IPV6_ADDR_LINKLOCAL)) - return -EINVAL; + goto out;
idev = __in6_dev_get(skb->dev); if (!idev) - return -ENODEV; + goto out;
/* * Cancel the work for this group */
- for_each_mc_rcu(idev, ma) { + for_each_mc_rtnl(idev, ma) { if (ipv6_addr_equal(&ma->mca_addr, &mld->mld_mca)) { - spin_lock(&ma->mca_lock); if (cancel_delayed_work(&ma->mca_work)) refcount_dec(&ma->mca_refcnt); ma->mca_flags &= ~(MAF_LAST_REPORTER | MAF_TIMER_RUNNING); - spin_unlock(&ma->mca_lock); break; } } - return 0; + +out: + consume_skb(skb); +} + +static void mld_report_work(struct work_struct *work) +{ + struct inet6_dev *idev = container_of(to_delayed_work(work), + struct inet6_dev, + mc_report_work); + struct sk_buff_head q; + struct sk_buff *skb; + bool rework = false; + int cnt = 0; + + skb_queue_head_init(&q); + spin_lock_bh(&idev->mc_report_lock); + while ((skb = __skb_dequeue(&idev->mc_report_queue))) { + __skb_queue_tail(&q, skb); + + if (++cnt >= MLD_MAX_QUEUE) { + rework = true; + schedule_delayed_work(&idev->mc_report_work, 0); + break; + } + } + spin_unlock_bh(&idev->mc_report_lock); + + rtnl_lock(); + while ((skb = __skb_dequeue(&q))) + __mld_report_work(skb); + rtnl_unlock(); + + if (!rework) + in6_dev_put(idev); }
static bool is_in(struct ifmcaddr6 *pmc, struct ip6_sf_list *psf, int type, @@ -1847,22 +1973,18 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc) for_each_mc_rtnl(idev, pmc) { if (pmc->mca_flags & MAF_NOREPORT) continue; - spin_lock_bh(&pmc->mca_lock); if (pmc->mca_sfcount[MCAST_EXCLUDE]) type = MLD2_MODE_IS_EXCLUDE; else type = MLD2_MODE_IS_INCLUDE; skb = add_grec(skb, pmc, type, 0, 0, 0); - spin_unlock_bh(&pmc->mca_lock); } } else { - spin_lock_bh(&pmc->mca_lock); if (pmc->mca_sfcount[MCAST_EXCLUDE]) type = MLD2_MODE_IS_EXCLUDE; else type = MLD2_MODE_IS_INCLUDE; skb = add_grec(skb, pmc, type, 0, 0, 0); - spin_unlock_bh(&pmc->mca_lock); } if (skb) mld_sendpack(skb); @@ -1938,7 +2060,6 @@ static void mld_send_cr(struct inet6_dev *idev)
/* change recs */ for_each_mc_rtnl(idev, pmc) { - spin_lock_bh(&pmc->mca_lock); if (pmc->mca_sfcount[MCAST_EXCLUDE]) { type = MLD2_BLOCK_OLD_SOURCES; dtype = MLD2_ALLOW_NEW_SOURCES; @@ -1958,7 +2079,6 @@ static void mld_send_cr(struct inet6_dev *idev) skb = add_grec(skb, pmc, type, 0, 0, 0); pmc->mca_crcount--; } - spin_unlock_bh(&pmc->mca_lock); } if (!skb) return; @@ -2072,13 +2192,11 @@ static void mld_send_initial_cr(struct inet6_dev *idev)
skb = NULL; for_each_mc_rtnl(idev, pmc) { - spin_lock_bh(&pmc->mca_lock); if (pmc->mca_sfcount[MCAST_EXCLUDE]) type = MLD2_CHANGE_TO_EXCLUDE; else type = MLD2_ALLOW_NEW_SOURCES; skb = add_grec(skb, pmc, type, 0, 0, 1); - spin_unlock_bh(&pmc->mca_lock); } if (skb) mld_sendpack(skb); @@ -2104,13 +2222,13 @@ static void mld_dad_work(struct work_struct *work)
rtnl_lock(); mld_send_initial_cr(idev); - rtnl_unlock(); if (idev->mc_dad_count) { idev->mc_dad_count--; if (idev->mc_dad_count) mld_dad_start_work(idev, unsolicited_report_interval(idev)); } + rtnl_unlock(); in6_dev_put(idev); }
@@ -2173,12 +2291,10 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, } if (!pmc) return -ESRCH; - spin_lock_bh(&pmc->mca_lock);
sf_markstate(pmc); if (!delta) { if (!pmc->mca_sfcount[sfmode]) { - spin_unlock_bh(&pmc->mca_lock); return -EINVAL; }
@@ -2206,7 +2322,6 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, mld_ifc_event(pmc->idev); } else if (sf_setstate(pmc) || changerec) mld_ifc_event(pmc->idev); - spin_unlock_bh(&pmc->mca_lock); return err; }
@@ -2225,7 +2340,7 @@ static int ip6_mc_add1_src(struct ifmcaddr6 *pmc, int sfmode, psf_prev = psf; } if (!psf) { - psf = kzalloc(sizeof(*psf), GFP_ATOMIC); + psf = kzalloc(sizeof(*psf), GFP_KERNEL); if (!psf) return -ENOBUFS;
@@ -2304,7 +2419,7 @@ static int sf_setstate(struct ifmcaddr6 *pmc) &psf->sf_addr)) break; if (!dpsf) { - dpsf = kmalloc(sizeof(*dpsf), GFP_ATOMIC); + dpsf = kmalloc(sizeof(*dpsf), GFP_KERNEL); if (!dpsf) continue; *dpsf = *psf; @@ -2339,7 +2454,6 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca, } if (!pmc) return -ESRCH; - spin_lock_bh(&pmc->mca_lock);
sf_markstate(pmc); isexclude = pmc->mca_sfmode == MCAST_EXCLUDE; @@ -2376,7 +2490,6 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca, } else if (sf_setstate(pmc)) { mld_ifc_event(idev); } - spin_unlock_bh(&pmc->mca_lock); return err; }
@@ -2415,7 +2528,6 @@ static void igmp6_join_group(struct ifmcaddr6 *ma)
delay = prandom_u32() % unsolicited_report_interval(ma->idev);
- spin_lock_bh(&ma->mca_lock); if (cancel_delayed_work(&ma->mca_work)) { refcount_dec(&ma->mca_refcnt); delay = ma->mca_work.timer.expires - jiffies; @@ -2424,7 +2536,6 @@ static void igmp6_join_group(struct ifmcaddr6 *ma) if (!mod_delayed_work(mld_wq, &ma->mca_work, delay)) refcount_inc(&ma->mca_refcnt); ma->mca_flags |= MAF_TIMER_RUNNING | MAF_LAST_REPORTER; - spin_unlock_bh(&ma->mca_lock); }
static int ip6_mc_leave_src(struct sock *sk, struct ipv6_mc_socklist *iml, @@ -2469,9 +2580,8 @@ static void mld_gq_work(struct work_struct *work)
rtnl_lock(); mld_send_report(idev, NULL); - rtnl_unlock(); - idev->mc_gq_running = 0; + rtnl_unlock();
in6_dev_put(idev); } @@ -2484,7 +2594,6 @@ static void mld_ifc_work(struct work_struct *work)
rtnl_lock(); mld_send_cr(idev); - rtnl_unlock();
if (idev->mc_ifc_count) { idev->mc_ifc_count--; @@ -2492,6 +2601,7 @@ static void mld_ifc_work(struct work_struct *work) mld_ifc_start_work(idev, unsolicited_report_interval(idev)); } + rtnl_unlock(); in6_dev_put(idev); }
@@ -2514,12 +2624,10 @@ static void mld_mca_work(struct work_struct *work) igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REPORT); else mld_send_report(ma->idev, ma); - rtnl_unlock(); - - spin_lock_bh(&ma->mca_lock); ma->mca_flags |= MAF_LAST_REPORTER; ma->mca_flags &= ~MAF_TIMER_RUNNING; - spin_unlock_bh(&ma->mca_lock); + rtnl_unlock(); + ma_put(ma); }
@@ -2553,6 +2661,9 @@ void ipv6_mc_down(struct inet6_dev *idev) /* Should stop work after group drop. or we will * start work again in mld_ifc_event() */ + synchronize_net(); + mld_query_stop_work(idev); + mld_report_stop_work(idev); mld_ifc_stop_work(idev); mld_gq_stop_work(idev); mld_dad_stop_work(idev); @@ -2592,6 +2703,12 @@ void ipv6_mc_init_dev(struct inet6_dev *idev) idev->mc_ifc_count = 0; INIT_DELAYED_WORK(&idev->mc_ifc_work, mld_ifc_work); INIT_DELAYED_WORK(&idev->mc_dad_work, mld_dad_work); + INIT_DELAYED_WORK(&idev->mc_query_work, mld_query_work); + INIT_DELAYED_WORK(&idev->mc_report_work, mld_report_work); + skb_queue_head_init(&idev->mc_query_queue); + skb_queue_head_init(&idev->mc_report_queue); + spin_lock_init(&idev->mc_query_lock); + spin_lock_init(&idev->mc_report_lock); ipv6_mc_reset(idev); }
@@ -2606,6 +2723,8 @@ void ipv6_mc_destroy_dev(struct inet6_dev *idev) /* Deactivate works */ ipv6_mc_down(idev); mld_clear_delrec(idev); + mld_clear_query(idev); + mld_clear_report(idev);
/* Delete all-nodes address. */ /* We cannot call ipv6_dev_mc_dec() directly, our caller in @@ -2795,14 +2914,12 @@ static inline struct ip6_sf_list *igmp6_mcf_get_first(struct seq_file *seq)
im = rcu_dereference(idev->mc_list); if (likely(im)) { - spin_lock_bh(&im->mca_lock); psf = rcu_dereference(im->mca_sources); if (likely(psf)) { state->im = im; state->idev = idev; break; } - spin_unlock_bh(&im->mca_lock); } } return psf; @@ -2814,7 +2931,6 @@ static struct ip6_sf_list *igmp6_mcf_get_next(struct seq_file *seq, struct ip6_s
psf = rcu_dereference(psf->sf_next); while (!psf) { - spin_unlock_bh(&state->im->mca_lock); state->im = rcu_dereference(state->im->next); while (!state->im) { state->dev = next_net_device_rcu(state->dev); @@ -2829,7 +2945,6 @@ static struct ip6_sf_list *igmp6_mcf_get_next(struct seq_file *seq, struct ip6_s } if (!state->im) break; - spin_lock_bh(&state->im->mca_lock); psf = rcu_dereference(state->im->mca_sources); } out: @@ -2868,10 +2983,8 @@ static void igmp6_mcf_seq_stop(struct seq_file *seq, void *v) { struct igmp6_mcf_iter_state *state = igmp6_mcf_seq_private(seq);
- if (likely(state->im)) { - spin_unlock_bh(&state->im->mca_lock); + if (likely(state->im)) state->im = NULL; - } if (likely(state->idev)) state->idev = NULL;
@@ -2955,6 +3068,7 @@ static int __net_init igmp6_net_init(struct net *net) }
inet6_sk(net->ipv6.igmp_sk)->hop_limit = 1; + net->ipv6.igmp_sk->sk_allocation = GFP_KERNEL;
err = inet_ctl_sock_create(&net->ipv6.mc_autojoin_sk, PF_INET6, SOCK_RAW, IPPROTO_ICMPV6, net);
The purpose of this lock is to avoid a bottleneck in the query/report event handler logic.
By previous patches, almost all mld data is protected by RTNL. So, the query and report event handler, which is data path logic acquires RTNL too. Therefore if a lot of query and report events are received, it uses RTNL for a long time. So it makes the control-plane bottleneck because of using RTNL. In order to avoid this bottleneck, mc_lock is added.
mc_lock protect only per-interface mld data and per-interface mld data is used in the query/report event handler logic. So, no longer rtnl_lock is needed in the query/report event handler logic. Therefore bottleneck will be disappeared by mc_lock.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com --- v3: - Initial patch
include/net/if_inet6.h | 1 + net/ipv6/mcast.c | 309 +++++++++++++++++++++++++---------------- 2 files changed, 194 insertions(+), 116 deletions(-)
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 882e0f88756f..71bb4cc4d05d 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -190,6 +190,7 @@ struct inet6_dev {
spinlock_t mc_query_lock; /* mld query queue lock */ spinlock_t mc_report_lock; /* mld query report lock */ + struct mutex mc_lock; /* mld global lock */
struct ifacaddr6 *ac_list; rwlock_t lock; diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 3ad754388933..49b0cebfdcdc 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -111,6 +111,8 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; /* * socket join on multicast group */ +#define mc_dereference(e, idev) \ + rcu_dereference_protected(e, lockdep_is_held(&(idev)->mc_lock))
#define for_each_pmc_rtnl(np, pmc) \ for (pmc = rtnl_dereference((np)->ipv6_mc_list); \ @@ -122,10 +124,10 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; pmc; \ pmc = rcu_dereference(pmc->next))
-#define for_each_psf_rtnl(mc, psf) \ - for (psf = rtnl_dereference((mc)->mca_sources); \ +#define for_each_psf_mclock(mc, psf) \ + for (psf = mc_dereference((mc)->mca_sources, mc->idev); \ psf; \ - psf = rtnl_dereference(psf->sf_next)) + psf = mc_dereference(psf->sf_next, mc->idev))
#define for_each_psf_rcu(mc, psf) \ for (psf = rcu_dereference((mc)->mca_sources); \ @@ -133,14 +135,14 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; psf = rcu_dereference(psf->sf_next))
#define for_each_psf_tomb(mc, psf) \ - for (psf = rtnl_dereference((mc)->mca_tomb); \ + for (psf = mc_dereference((mc)->mca_tomb, mc->idev); \ psf; \ - psf = rtnl_dereference(psf->sf_next)) + psf = mc_dereference(psf->sf_next, mc->idev))
-#define for_each_mc_rtnl(idev, mc) \ - for (mc = rtnl_dereference((idev)->mc_list); \ +#define for_each_mc_mclock(idev, mc) \ + for (mc = mc_dereference((idev)->mc_list, idev); \ mc; \ - mc = rtnl_dereference(mc->next)) + mc = mc_dereference(mc->next, idev))
#define for_each_mc_rcu(idev, mc) \ for (mc = rcu_dereference((idev)->mc_list); \ @@ -148,9 +150,9 @@ int sysctl_mld_qrv __read_mostly = MLD_QRV_DEFAULT; mc = rcu_dereference(mc->next))
#define for_each_mc_tomb(idev, mc) \ - for (mc = rtnl_dereference((idev)->mc_tomb); \ + for (mc = mc_dereference((idev)->mc_tomb, idev); \ mc; \ - mc = rtnl_dereference(mc->next)) + mc = mc_dereference(mc->next, idev))
static int unsolicited_report_interval(struct inet6_dev *idev) { @@ -268,11 +270,12 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr) if (dev) { struct inet6_dev *idev = __in6_dev_get(dev);
- (void) ip6_mc_leave_src(sk, mc_lst, idev); + ip6_mc_leave_src(sk, mc_lst, idev); if (idev) __ipv6_dev_mc_dec(idev, &mc_lst->addr); - } else - (void) ip6_mc_leave_src(sk, mc_lst, NULL); + } else { + ip6_mc_leave_src(sk, mc_lst, NULL); + }
atomic_sub(sizeof(*mc_lst), &sk->sk_omem_alloc); kfree_rcu(mc_lst, rcu); @@ -329,11 +332,12 @@ void __ipv6_sock_mc_close(struct sock *sk) if (dev) { struct inet6_dev *idev = __in6_dev_get(dev);
- (void) ip6_mc_leave_src(sk, mc_lst, idev); + ip6_mc_leave_src(sk, mc_lst, idev); if (idev) __ipv6_dev_mc_dec(idev, &mc_lst->addr); - } else - (void) ip6_mc_leave_src(sk, mc_lst, NULL); + } else { + ip6_mc_leave_src(sk, mc_lst, NULL); + }
atomic_sub(sizeof(*mc_lst), &sk->sk_omem_alloc); kfree_rcu(mc_lst, rcu); @@ -376,6 +380,7 @@ int ip6_mc_source(int add, int omode, struct sock *sk,
err = -EADDRNOTAVAIL;
+ mutex_lock(&idev->mc_lock); for_each_pmc_rtnl(inet6, pmc) { if (pgsr->gsr_interface && pmc->ifindex != pgsr->gsr_interface) continue; @@ -469,6 +474,7 @@ int ip6_mc_source(int add, int omode, struct sock *sk, /* update the interface list */ ip6_mc_add_src(idev, group, omode, 1, source, 1); done: + mutex_unlock(&idev->mc_lock); if (leavegroup) err = ipv6_sock_mc_drop(sk, pgsr->gsr_interface, group); return err; @@ -529,25 +535,33 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf, psin6 = (struct sockaddr_in6 *)list; newpsl->sl_addr[i] = psin6->sin6_addr; } + mutex_lock(&idev->mc_lock); err = ip6_mc_add_src(idev, group, gsf->gf_fmode, - newpsl->sl_count, newpsl->sl_addr, 0); + newpsl->sl_count, newpsl->sl_addr, 0); if (err) { + mutex_unlock(&idev->mc_lock); sock_kfree_s(sk, newpsl, IP6_SFLSIZE(newpsl->sl_max)); goto done; } + mutex_unlock(&idev->mc_lock); } else { newpsl = NULL; - (void) ip6_mc_add_src(idev, group, gsf->gf_fmode, 0, NULL, 0); + mutex_lock(&idev->mc_lock); + ip6_mc_add_src(idev, group, gsf->gf_fmode, 0, NULL, 0); + mutex_unlock(&idev->mc_lock); }
+ mutex_lock(&idev->mc_lock); psl = rtnl_dereference(pmc->sflist); if (psl) { - (void) ip6_mc_del_src(idev, group, pmc->sfmode, - psl->sl_count, psl->sl_addr, 0); + ip6_mc_del_src(idev, group, pmc->sfmode, + psl->sl_count, psl->sl_addr, 0); atomic_sub(IP6_SFLSIZE(psl->sl_max), &sk->sk_omem_alloc); kfree_rcu(psl, rcu); - } else - (void) ip6_mc_del_src(idev, group, pmc->sfmode, 0, NULL, 0); + } else { + ip6_mc_del_src(idev, group, pmc->sfmode, 0, NULL, 0); + } + mutex_unlock(&idev->mc_lock); rcu_assign_pointer(pmc->sflist, newpsl); pmc->sfmode = gsf->gf_fmode; err = 0; @@ -650,6 +664,7 @@ bool inet6_mc_check(struct sock *sk, const struct in6_addr *mc_addr, return rv; }
+/* called with mc_lock */ static void igmp6_group_added(struct ifmcaddr6 *mc) { struct net_device *dev = mc->idev->dev; @@ -684,6 +699,7 @@ static void igmp6_group_added(struct ifmcaddr6 *mc) mld_ifc_event(mc->idev); }
+/* called with mc_lock */ static void igmp6_group_dropped(struct ifmcaddr6 *mc) { struct net_device *dev = mc->idev->dev; @@ -711,6 +727,7 @@ static void igmp6_group_dropped(struct ifmcaddr6 *mc)
/* * deleted ifmcaddr6 manipulation + * called with mc_lock */ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) { @@ -735,13 +752,13 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) struct ip6_sf_list *psf;
rcu_assign_pointer(pmc->mca_tomb, - rtnl_dereference(im->mca_tomb)); + mc_dereference(im->mca_tomb, idev)); rcu_assign_pointer(pmc->mca_sources, - rtnl_dereference(im->mca_sources)); + mc_dereference(im->mca_sources, idev)); RCU_INIT_POINTER(im->mca_tomb, NULL); RCU_INIT_POINTER(im->mca_sources, NULL);
- for_each_psf_rtnl(pmc, psf) + for_each_psf_mclock(pmc, psf) psf->sf_crcount = pmc->mca_crcount; }
@@ -749,6 +766,7 @@ static void mld_add_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) rcu_assign_pointer(idev->mc_tomb, pmc); }
+/* called with mc_lock */ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) { struct ip6_sf_list *psf, *sources, *tomb; @@ -772,15 +790,15 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) im->idev = pmc->idev; if (im->mca_sfmode == MCAST_INCLUDE) { tomb = rcu_replace_pointer(im->mca_tomb, - rtnl_dereference(pmc->mca_tomb), - lockdep_rtnl_is_held()); + mc_dereference(pmc->mca_tomb, pmc->idev), + lockdep_is_held(&im->idev->mc_lock)); rcu_assign_pointer(pmc->mca_tomb, tomb);
sources = rcu_replace_pointer(im->mca_sources, - rtnl_dereference(pmc->mca_sources), - lockdep_rtnl_is_held()); + mc_dereference(pmc->mca_sources, pmc->idev), + lockdep_is_held(&im->idev->mc_lock)); rcu_assign_pointer(pmc->mca_sources, sources); - for_each_psf_rtnl(im, psf) + for_each_psf_mclock(im, psf) psf->sf_crcount = idev->mc_qrv; } else { im->mca_crcount = idev->mc_qrv; @@ -791,28 +809,29 @@ static void mld_del_delrec(struct inet6_dev *idev, struct ifmcaddr6 *im) } }
+/* called with mc_lock */ static void mld_clear_delrec(struct inet6_dev *idev) { struct ifmcaddr6 *pmc, *nextpmc;
- pmc = rtnl_dereference(idev->mc_tomb); + pmc = mc_dereference(idev->mc_tomb, idev); RCU_INIT_POINTER(idev->mc_tomb, NULL);
for (; pmc; pmc = nextpmc) { - nextpmc = rtnl_dereference(pmc->next); + nextpmc = mc_dereference(pmc->next, idev); ip6_mc_clear_src(pmc); in6_dev_put(pmc->idev); kfree_rcu(pmc, rcu); }
/* clear dead sources, too */ - for_each_mc_rtnl(idev, pmc) { + for_each_mc_mclock(idev, pmc) { struct ip6_sf_list *psf, *psf_next;
- psf = rtnl_dereference(pmc->mca_tomb); + psf = mc_dereference(pmc->mca_tomb, idev); RCU_INIT_POINTER(pmc->mca_tomb, NULL); for (; psf; psf = psf_next) { - psf_next = rtnl_dereference(psf->sf_next); + psf_next = mc_dereference(psf->sf_next, idev); kfree_rcu(psf, rcu); } } @@ -851,6 +870,7 @@ static void ma_put(struct ifmcaddr6 *mc) } }
+/* called with mc_lock */ static struct ifmcaddr6 *mca_alloc(struct inet6_dev *idev, const struct in6_addr *addr, unsigned int mode) @@ -902,10 +922,12 @@ static int __ipv6_dev_mc_inc(struct net_device *dev, return -ENODEV; }
- for_each_mc_rtnl(idev, mc) { + mutex_lock(&idev->mc_lock); + for_each_mc_mclock(idev, mc) { if (ipv6_addr_equal(&mc->mca_addr, addr)) { mc->mca_users++; ip6_mc_add_src(idev, &mc->mca_addr, mode, 0, NULL, 0); + mutex_unlock(&idev->mc_lock); in6_dev_put(idev); return 0; } @@ -913,6 +935,7 @@ static int __ipv6_dev_mc_inc(struct net_device *dev,
mc = mca_alloc(idev, addr, mode); if (!mc) { + mutex_unlock(&idev->mc_lock); in6_dev_put(idev); return -ENOMEM; } @@ -924,6 +947,7 @@ static int __ipv6_dev_mc_inc(struct net_device *dev,
mld_del_delrec(idev, mc); igmp6_group_added(mc); + mutex_unlock(&idev->mc_lock); ma_put(mc); return 0; } @@ -935,7 +959,7 @@ int ipv6_dev_mc_inc(struct net_device *dev, const struct in6_addr *addr) EXPORT_SYMBOL(ipv6_dev_mc_inc);
/* - * device multicast group del + * device multicast group del */ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr) { @@ -943,8 +967,9 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
ASSERT_RTNL();
+ mutex_lock(&idev->mc_lock); for (map = &idev->mc_list; - (ma = rtnl_dereference(*map)); + (ma = mc_dereference(*map, idev)); map = &ma->next) { if (ipv6_addr_equal(&ma->mca_addr, addr)) { if (--ma->mca_users == 0) { @@ -952,14 +977,17 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
igmp6_group_dropped(ma); ip6_mc_clear_src(ma); + mutex_unlock(&idev->mc_lock);
ma_put(ma); return 0; } + mutex_unlock(&idev->mc_lock); return 0; } }
+ mutex_unlock(&idev->mc_lock); return -ENOENT; }
@@ -1019,6 +1047,7 @@ bool ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group, return rv; }
+/* called with mc_lock */ static void mld_gq_start_work(struct inet6_dev *idev) { unsigned long tv = prandom_u32() % idev->mc_maxdelay; @@ -1028,6 +1057,7 @@ static void mld_gq_start_work(struct inet6_dev *idev) in6_dev_hold(idev); }
+/* called with mc_lock */ static void mld_gq_stop_work(struct inet6_dev *idev) { idev->mc_gq_running = 0; @@ -1035,6 +1065,7 @@ static void mld_gq_stop_work(struct inet6_dev *idev) __in6_dev_put(idev); }
+/* called with mc_lock */ static void mld_ifc_start_work(struct inet6_dev *idev, unsigned long delay) { unsigned long tv = prandom_u32() % delay; @@ -1043,6 +1074,7 @@ static void mld_ifc_start_work(struct inet6_dev *idev, unsigned long delay) in6_dev_hold(idev); }
+/* called with mc_lock */ static void mld_ifc_stop_work(struct inet6_dev *idev) { idev->mc_ifc_count = 0; @@ -1050,6 +1082,7 @@ static void mld_ifc_stop_work(struct inet6_dev *idev) __in6_dev_put(idev); }
+/* called with mc_lock */ static void mld_dad_start_work(struct inet6_dev *idev, unsigned long delay) { unsigned long tv = prandom_u32() % delay; @@ -1080,6 +1113,7 @@ static void mld_report_stop_work(struct inet6_dev *idev)
/* * IGMP handling (alias multicast ICMPv6 messages) + * called with mc_lock */ static void igmp6_group_queried(struct ifmcaddr6 *ma, unsigned long resptime) { @@ -1103,7 +1137,9 @@ static void igmp6_group_queried(struct ifmcaddr6 *ma, unsigned long resptime) ma->mca_flags |= MAF_TIMER_RUNNING; }
-/* mark EXCLUDE-mode sources */ +/* mark EXCLUDE-mode sources + * called with mc_lock + */ static bool mld_xmarksources(struct ifmcaddr6 *pmc, int nsrcs, const struct in6_addr *srcs) { @@ -1111,7 +1147,7 @@ static bool mld_xmarksources(struct ifmcaddr6 *pmc, int nsrcs, int i, scount;
scount = 0; - for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (scount == nsrcs) break; for (i = 0; i < nsrcs; i++) { @@ -1132,6 +1168,7 @@ static bool mld_xmarksources(struct ifmcaddr6 *pmc, int nsrcs, return true; }
+/* called with mc_lock */ static bool mld_marksources(struct ifmcaddr6 *pmc, int nsrcs, const struct in6_addr *srcs) { @@ -1144,7 +1181,7 @@ static bool mld_marksources(struct ifmcaddr6 *pmc, int nsrcs, /* mark INCLUDE-mode sources */
scount = 0; - for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (scount == nsrcs) break; for (i = 0; i < nsrcs; i++) { @@ -1370,7 +1407,7 @@ static void __mld_query_work(struct sk_buff *skb) int len, err;
if (!pskb_may_pull(skb, sizeof(struct in6_addr))) - goto out; + goto kfree_skb;
/* compute payload length excluding extension headers */ len = ntohs(ipv6_hdr(skb)->payload_len) + sizeof(struct ipv6hdr); @@ -1387,11 +1424,11 @@ static void __mld_query_work(struct sk_buff *skb) ipv6_hdr(skb)->hop_limit != 1 || !(IP6CB(skb)->flags & IP6SKB_ROUTERALERT) || IP6CB(skb)->ra != htons(IPV6_OPT_ROUTERALERT_MLD)) - goto out; + goto kfree_skb;
- idev = __in6_dev_get(skb->dev); + idev = in6_dev_get(skb->dev); if (!idev) - goto out; + goto kfree_skb;
mld = (struct mld_msg *)icmp6_hdr(skb); group = &mld->mld_mca; @@ -1442,11 +1479,11 @@ static void __mld_query_work(struct sk_buff *skb) }
if (group_type == IPV6_ADDR_ANY) { - for_each_mc_rtnl(idev, ma) { + for_each_mc_mclock(idev, ma) { igmp6_group_queried(ma, max_delay); } } else { - for_each_mc_rtnl(idev, ma) { + for_each_mc_mclock(idev, ma) { if (!ipv6_addr_equal(group, &ma->mca_addr)) continue; if (ma->mca_flags & MAF_TIMER_RUNNING) { @@ -1468,6 +1505,8 @@ static void __mld_query_work(struct sk_buff *skb) }
out: + in6_dev_put(idev); +kfree_skb: consume_skb(skb); }
@@ -1495,10 +1534,10 @@ static void mld_query_work(struct work_struct *work) } spin_unlock_bh(&idev->mc_query_lock);
- rtnl_lock(); + mutex_lock(&idev->mc_lock); while ((skb = __skb_dequeue(&q))) __mld_query_work(skb); - rtnl_unlock(); + mutex_unlock(&idev->mc_lock);
if (!rework) in6_dev_put(idev); @@ -1530,22 +1569,22 @@ int igmp6_event_report(struct sk_buff *skb)
static void __mld_report_work(struct sk_buff *skb) { - struct ifmcaddr6 *ma; struct inet6_dev *idev; + struct ifmcaddr6 *ma; struct mld_msg *mld; int addr_type;
/* Our own report looped back. Ignore it. */ if (skb->pkt_type == PACKET_LOOPBACK) - goto out; + goto kfree_skb;
/* send our report if the MC router may not have heard this report */ if (skb->pkt_type != PACKET_MULTICAST && skb->pkt_type != PACKET_BROADCAST) - goto out; + goto kfree_skb;
if (!pskb_may_pull(skb, sizeof(*mld) - sizeof(struct icmp6hdr))) - goto out; + goto kfree_skb;
mld = (struct mld_msg *)icmp6_hdr(skb);
@@ -1553,17 +1592,17 @@ static void __mld_report_work(struct sk_buff *skb) addr_type = ipv6_addr_type(&ipv6_hdr(skb)->saddr); if (addr_type != IPV6_ADDR_ANY && !(addr_type&IPV6_ADDR_LINKLOCAL)) - goto out; + goto kfree_skb;
- idev = __in6_dev_get(skb->dev); + idev = in6_dev_get(skb->dev); if (!idev) - goto out; + goto kfree_skb;
/* * Cancel the work for this group */
- for_each_mc_rtnl(idev, ma) { + for_each_mc_mclock(idev, ma) { if (ipv6_addr_equal(&ma->mca_addr, &mld->mld_mca)) { if (cancel_delayed_work(&ma->mca_work)) refcount_dec(&ma->mca_refcnt); @@ -1573,7 +1612,8 @@ static void __mld_report_work(struct sk_buff *skb) } }
-out: + in6_dev_put(idev); +kfree_skb: consume_skb(skb); }
@@ -1600,10 +1640,10 @@ static void mld_report_work(struct work_struct *work) } spin_unlock_bh(&idev->mc_report_lock);
- rtnl_lock(); + mutex_lock(&idev->mc_lock); while ((skb = __skb_dequeue(&q))) __mld_report_work(skb); - rtnl_unlock(); + mutex_unlock(&idev->mc_lock);
if (!rework) in6_dev_put(idev); @@ -1659,7 +1699,7 @@ mld_scount(struct ifmcaddr6 *pmc, int type, int gdeleted, int sdeleted) struct ip6_sf_list *psf; int scount = 0;
- for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (!is_in(pmc, psf, type, gdeleted, sdeleted)) continue; scount++; @@ -1833,6 +1873,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
#define AVAILABLE(skb) ((skb) ? skb_availroom(skb) : 0)
+/* called with mc_lock */ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, int type, int gdeleted, int sdeleted, int crsend) @@ -1878,12 +1919,12 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, } first = 1; psf_prev = NULL; - for (psf = rtnl_dereference(*psf_list); + for (psf = mc_dereference(*psf_list, idev); psf; psf = psf_next) { struct in6_addr *psrc;
- psf_next = rtnl_dereference(psf->sf_next); + psf_next = mc_dereference(psf->sf_next, idev);
if (!is_in(pmc, psf, type, gdeleted, sdeleted) && !crsend) { psf_prev = psf; @@ -1931,10 +1972,10 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, if ((sdeleted || gdeleted) && psf->sf_crcount == 0) { if (psf_prev) rcu_assign_pointer(psf_prev->sf_next, - rtnl_dereference(psf->sf_next)); + mc_dereference(psf->sf_next, idev)); else rcu_assign_pointer(*psf_list, - rtnl_dereference(psf->sf_next)); + mc_dereference(psf->sf_next, idev)); kfree_rcu(psf, rcu); continue; } @@ -1964,13 +2005,14 @@ static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc, return skb; }
+/* called with mc_lock */ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc) { struct sk_buff *skb = NULL; int type;
if (!pmc) { - for_each_mc_rtnl(idev, pmc) { + for_each_mc_mclock(idev, pmc) { if (pmc->mca_flags & MAF_NOREPORT) continue; if (pmc->mca_sfcount[MCAST_EXCLUDE]) @@ -1992,23 +2034,24 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
/* * remove zero-count source records from a source filter list + * called with mc_lock */ -static void mld_clear_zeros(struct ip6_sf_list __rcu **ppsf) +static void mld_clear_zeros(struct ip6_sf_list __rcu **ppsf, struct inet6_dev *idev) { struct ip6_sf_list *psf_prev, *psf_next, *psf;
psf_prev = NULL; - for (psf = rtnl_dereference(*ppsf); + for (psf = mc_dereference(*ppsf, idev); psf; psf = psf_next) { - psf_next = rtnl_dereference(psf->sf_next); + psf_next = mc_dereference(psf->sf_next, idev); if (psf->sf_crcount == 0) { if (psf_prev) rcu_assign_pointer(psf_prev->sf_next, - rtnl_dereference(psf->sf_next)); + mc_dereference(psf->sf_next, idev)); else rcu_assign_pointer(*ppsf, - rtnl_dereference(psf->sf_next)); + mc_dereference(psf->sf_next, idev)); kfree_rcu(psf, rcu); } else { psf_prev = psf; @@ -2016,6 +2059,7 @@ static void mld_clear_zeros(struct ip6_sf_list __rcu **ppsf) } }
+/* called with mc_lock */ static void mld_send_cr(struct inet6_dev *idev) { struct ifmcaddr6 *pmc, *pmc_prev, *pmc_next; @@ -2024,10 +2068,10 @@ static void mld_send_cr(struct inet6_dev *idev)
/* deleted MCA's */ pmc_prev = NULL; - for (pmc = rtnl_dereference(idev->mc_tomb); + for (pmc = mc_dereference(idev->mc_tomb, idev); pmc; pmc = pmc_next) { - pmc_next = rtnl_dereference(pmc->next); + pmc_next = mc_dereference(pmc->next, idev); if (pmc->mca_sfmode == MCAST_INCLUDE) { type = MLD2_BLOCK_OLD_SOURCES; dtype = MLD2_BLOCK_OLD_SOURCES; @@ -2041,8 +2085,8 @@ static void mld_send_cr(struct inet6_dev *idev) } pmc->mca_crcount--; if (pmc->mca_crcount == 0) { - mld_clear_zeros(&pmc->mca_tomb); - mld_clear_zeros(&pmc->mca_sources); + mld_clear_zeros(&pmc->mca_tomb, idev); + mld_clear_zeros(&pmc->mca_sources, idev); } } if (pmc->mca_crcount == 0 && @@ -2059,7 +2103,7 @@ static void mld_send_cr(struct inet6_dev *idev) }
/* change recs */ - for_each_mc_rtnl(idev, pmc) { + for_each_mc_mclock(idev, pmc) { if (pmc->mca_sfcount[MCAST_EXCLUDE]) { type = MLD2_BLOCK_OLD_SOURCES; dtype = MLD2_ALLOW_NEW_SOURCES; @@ -2181,6 +2225,7 @@ static void igmp6_send(struct in6_addr *addr, struct net_device *dev, int type) goto out; }
+/* called with mc_lock */ static void mld_send_initial_cr(struct inet6_dev *idev) { struct sk_buff *skb; @@ -2191,7 +2236,7 @@ static void mld_send_initial_cr(struct inet6_dev *idev) return;
skb = NULL; - for_each_mc_rtnl(idev, pmc) { + for_each_mc_mclock(idev, pmc) { if (pmc->mca_sfcount[MCAST_EXCLUDE]) type = MLD2_CHANGE_TO_EXCLUDE; else @@ -2204,6 +2249,7 @@ static void mld_send_initial_cr(struct inet6_dev *idev)
void ipv6_mc_dad_complete(struct inet6_dev *idev) { + mutex_lock(&idev->mc_lock); idev->mc_dad_count = idev->mc_qrv; if (idev->mc_dad_count) { mld_send_initial_cr(idev); @@ -2212,6 +2258,7 @@ void ipv6_mc_dad_complete(struct inet6_dev *idev) mld_dad_start_work(idev, unsolicited_report_interval(idev)); } + mutex_unlock(&idev->mc_lock); }
static void mld_dad_work(struct work_struct *work) @@ -2219,8 +2266,7 @@ static void mld_dad_work(struct work_struct *work) struct inet6_dev *idev = container_of(to_delayed_work(work), struct inet6_dev, mc_dad_work); - - rtnl_lock(); + mutex_lock(&idev->mc_lock); mld_send_initial_cr(idev); if (idev->mc_dad_count) { idev->mc_dad_count--; @@ -2228,10 +2274,11 @@ static void mld_dad_work(struct work_struct *work) mld_dad_start_work(idev, unsolicited_report_interval(idev)); } - rtnl_unlock(); + mutex_unlock(&idev->mc_lock); in6_dev_put(idev); }
+/* called with mc_lock */ static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode, const struct in6_addr *psfsrc) { @@ -2239,7 +2286,7 @@ static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode, int rv = 0;
psf_prev = NULL; - for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (ipv6_addr_equal(&psf->sf_addr, psfsrc)) break; psf_prev = psf; @@ -2255,16 +2302,16 @@ static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode, /* no more filters for this source */ if (psf_prev) rcu_assign_pointer(psf_prev->sf_next, - rtnl_dereference(psf->sf_next)); + mc_dereference(psf->sf_next, idev)); else rcu_assign_pointer(pmc->mca_sources, - rtnl_dereference(psf->sf_next)); + mc_dereference(psf->sf_next, idev));
if (psf->sf_oldin && !(pmc->mca_flags & MAF_NOREPORT) && !mld_in_v1_mode(idev)) { psf->sf_crcount = idev->mc_qrv; rcu_assign_pointer(psf->sf_next, - rtnl_dereference(pmc->mca_tomb)); + mc_dereference(pmc->mca_tomb, idev)); rcu_assign_pointer(pmc->mca_tomb, psf); rv = 1; } else { @@ -2274,6 +2321,7 @@ static int ip6_mc_del1_src(struct ifmcaddr6 *pmc, int sfmode, return rv; }
+/* called with mc_lock */ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, int sfmode, int sfcount, const struct in6_addr *psfsrc, int delta) @@ -2285,7 +2333,7 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, if (!idev) return -ENODEV;
- for_each_mc_rtnl(idev, pmc) { + for_each_mc_mclock(idev, pmc) { if (ipv6_addr_equal(pmca, &pmc->mca_addr)) break; } @@ -2294,9 +2342,8 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca,
sf_markstate(pmc); if (!delta) { - if (!pmc->mca_sfcount[sfmode]) { + if (!pmc->mca_sfcount[sfmode]) return -EINVAL; - }
pmc->mca_sfcount[sfmode]--; } @@ -2317,16 +2364,19 @@ static int ip6_mc_del_src(struct inet6_dev *idev, const struct in6_addr *pmca, pmc->mca_sfmode = MCAST_INCLUDE; pmc->mca_crcount = idev->mc_qrv; idev->mc_ifc_count = pmc->mca_crcount; - for_each_psf_rtnl(pmc, psf) + for_each_psf_mclock(pmc, psf) psf->sf_crcount = 0; mld_ifc_event(pmc->idev); - } else if (sf_setstate(pmc) || changerec) + } else if (sf_setstate(pmc) || changerec) { mld_ifc_event(pmc->idev); + } + return err; }
/* * Add multicast single-source filter to the interface list + * called with mc_lock */ static int ip6_mc_add1_src(struct ifmcaddr6 *pmc, int sfmode, const struct in6_addr *psfsrc) @@ -2334,7 +2384,7 @@ static int ip6_mc_add1_src(struct ifmcaddr6 *pmc, int sfmode, struct ip6_sf_list *psf, *psf_prev;
psf_prev = NULL; - for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (ipv6_addr_equal(&psf->sf_addr, psfsrc)) break; psf_prev = psf; @@ -2355,12 +2405,13 @@ static int ip6_mc_add1_src(struct ifmcaddr6 *pmc, int sfmode, return 0; }
+/* called with mc_lock */ static void sf_markstate(struct ifmcaddr6 *pmc) { struct ip6_sf_list *psf; int mca_xcount = pmc->mca_sfcount[MCAST_EXCLUDE];
- for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (pmc->mca_sfcount[MCAST_EXCLUDE]) { psf->sf_oldin = mca_xcount == psf->sf_count[MCAST_EXCLUDE] && @@ -2371,6 +2422,7 @@ static void sf_markstate(struct ifmcaddr6 *pmc) } }
+/* called with mc_lock */ static int sf_setstate(struct ifmcaddr6 *pmc) { struct ip6_sf_list *psf, *dpsf; @@ -2379,7 +2431,7 @@ static int sf_setstate(struct ifmcaddr6 *pmc) int new_in, rv;
rv = 0; - for_each_psf_rtnl(pmc, psf) { + for_each_psf_mclock(pmc, psf) { if (pmc->mca_sfcount[MCAST_EXCLUDE]) { new_in = mca_xcount == psf->sf_count[MCAST_EXCLUDE] && !psf->sf_count[MCAST_INCLUDE]; @@ -2398,10 +2450,12 @@ static int sf_setstate(struct ifmcaddr6 *pmc) if (dpsf) { if (prev) rcu_assign_pointer(prev->sf_next, - rtnl_dereference(dpsf->sf_next)); + mc_dereference(dpsf->sf_next, + pmc->idev)); else rcu_assign_pointer(pmc->mca_tomb, - rtnl_dereference(dpsf->sf_next)); + mc_dereference(dpsf->sf_next, + pmc->idev)); kfree_rcu(dpsf, rcu); } psf->sf_crcount = qrv; @@ -2424,7 +2478,7 @@ static int sf_setstate(struct ifmcaddr6 *pmc) continue; *dpsf = *psf; rcu_assign_pointer(dpsf->sf_next, - rtnl_dereference(pmc->mca_tomb)); + mc_dereference(pmc->mca_tomb, pmc->idev)); rcu_assign_pointer(pmc->mca_tomb, dpsf); } dpsf->sf_crcount = qrv; @@ -2436,6 +2490,7 @@ static int sf_setstate(struct ifmcaddr6 *pmc)
/* * Add multicast source filter list to the interface list + * called with mc_lock */ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca, int sfmode, int sfcount, const struct in6_addr *psfsrc, @@ -2448,7 +2503,7 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca, if (!idev) return -ENODEV;
- for_each_mc_rtnl(idev, pmc) { + for_each_mc_mclock(idev, pmc) { if (ipv6_addr_equal(pmca, &pmc->mca_addr)) break; } @@ -2484,7 +2539,7 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca,
pmc->mca_crcount = idev->mc_qrv; idev->mc_ifc_count = pmc->mca_crcount; - for_each_psf_rtnl(pmc, psf) + for_each_psf_mclock(pmc, psf) psf->sf_crcount = 0; mld_ifc_event(idev); } else if (sf_setstate(pmc)) { @@ -2493,21 +2548,22 @@ static int ip6_mc_add_src(struct inet6_dev *idev, const struct in6_addr *pmca, return err; }
+/* called with mc_lock */ static void ip6_mc_clear_src(struct ifmcaddr6 *pmc) { struct ip6_sf_list *psf, *nextpsf;
- for (psf = rtnl_dereference(pmc->mca_tomb); + for (psf = mc_dereference(pmc->mca_tomb, pmc->idev); psf; psf = nextpsf) { - nextpsf = rtnl_dereference(psf->sf_next); + nextpsf = mc_dereference(psf->sf_next, pmc->idev); kfree_rcu(psf, rcu); } RCU_INIT_POINTER(pmc->mca_tomb, NULL); - for (psf = rtnl_dereference(pmc->mca_sources); + for (psf = mc_dereference(pmc->mca_sources, pmc->idev); psf; psf = nextpsf) { - nextpsf = rtnl_dereference(psf->sf_next); + nextpsf = mc_dereference(psf->sf_next, pmc->idev); kfree_rcu(psf, rcu); } RCU_INIT_POINTER(pmc->mca_sources, NULL); @@ -2516,7 +2572,7 @@ static void ip6_mc_clear_src(struct ifmcaddr6 *pmc) pmc->mca_sfcount[MCAST_EXCLUDE] = 1; }
- +/* called with mc_lock */ static void igmp6_join_group(struct ifmcaddr6 *ma) { unsigned long delay; @@ -2546,19 +2602,27 @@ static int ip6_mc_leave_src(struct sock *sk, struct ipv6_mc_socklist *iml,
psl = rtnl_dereference(iml->sflist);
+ if (idev) + mutex_lock(&idev->mc_lock); + if (!psl) { /* any-source empty exclude case */ err = ip6_mc_del_src(idev, &iml->addr, iml->sfmode, 0, NULL, 0); } else { err = ip6_mc_del_src(idev, &iml->addr, iml->sfmode, - psl->sl_count, psl->sl_addr, 0); + psl->sl_count, psl->sl_addr, 0); RCU_INIT_POINTER(iml->sflist, NULL); atomic_sub(IP6_SFLSIZE(psl->sl_max), &sk->sk_omem_alloc); kfree_rcu(psl, rcu); } + + if (idev) + mutex_unlock(&idev->mc_lock); + return err; }
+/* called with mc_lock */ static void igmp6_leave_group(struct ifmcaddr6 *ma) { if (mld_in_v1_mode(ma->idev)) { @@ -2578,10 +2642,10 @@ static void mld_gq_work(struct work_struct *work) struct inet6_dev, mc_gq_work);
- rtnl_lock(); + mutex_lock(&idev->mc_lock); mld_send_report(idev, NULL); idev->mc_gq_running = 0; - rtnl_unlock(); + mutex_unlock(&idev->mc_lock);
in6_dev_put(idev); } @@ -2592,7 +2656,7 @@ static void mld_ifc_work(struct work_struct *work) struct inet6_dev, mc_ifc_work);
- rtnl_lock(); + mutex_lock(&idev->mc_lock); mld_send_cr(idev);
if (idev->mc_ifc_count) { @@ -2601,10 +2665,11 @@ static void mld_ifc_work(struct work_struct *work) mld_ifc_start_work(idev, unsolicited_report_interval(idev)); } - rtnl_unlock(); + mutex_unlock(&idev->mc_lock); in6_dev_put(idev); }
+/* called with mc_lock */ static void mld_ifc_event(struct inet6_dev *idev) { if (mld_in_v1_mode(idev)) @@ -2619,14 +2684,14 @@ static void mld_mca_work(struct work_struct *work) struct ifmcaddr6 *ma = container_of(to_delayed_work(work), struct ifmcaddr6, mca_work);
- rtnl_lock(); + mutex_lock(&ma->idev->mc_lock); if (mld_in_v1_mode(ma->idev)) igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REPORT); else mld_send_report(ma->idev, ma); ma->mca_flags |= MAF_LAST_REPORTER; ma->mca_flags &= ~MAF_TIMER_RUNNING; - rtnl_unlock(); + mutex_unlock(&ma->idev->mc_lock);
ma_put(ma); } @@ -2639,8 +2704,10 @@ void ipv6_mc_unmap(struct inet6_dev *idev)
/* Install multicast list, except for all-nodes (already installed) */
- for_each_mc_rtnl(idev, i) + mutex_lock(&idev->mc_lock); + for_each_mc_mclock(idev, i) igmp6_group_dropped(i); + mutex_unlock(&idev->mc_lock); }
void ipv6_mc_remap(struct inet6_dev *idev) @@ -2649,14 +2716,15 @@ void ipv6_mc_remap(struct inet6_dev *idev) }
/* Device going down */ - void ipv6_mc_down(struct inet6_dev *idev) { struct ifmcaddr6 *i;
+ mutex_lock(&idev->mc_lock); /* Withdraw multicast list */ - for_each_mc_rtnl(idev, i) + for_each_mc_mclock(idev, i) igmp6_group_dropped(i); + mutex_unlock(&idev->mc_lock);
/* Should stop work after group drop. or we will * start work again in mld_ifc_event() @@ -2687,10 +2755,12 @@ void ipv6_mc_up(struct inet6_dev *idev) /* Install multicast list, except for all-nodes (already installed) */
ipv6_mc_reset(idev); - for_each_mc_rtnl(idev, i) { + mutex_lock(&idev->mc_lock); + for_each_mc_mclock(idev, i) { mld_del_delrec(idev, i); igmp6_group_added(i); } + mutex_unlock(&idev->mc_lock); }
/* IPv6 device initialization. */ @@ -2709,6 +2779,7 @@ void ipv6_mc_init_dev(struct inet6_dev *idev) skb_queue_head_init(&idev->mc_report_queue); spin_lock_init(&idev->mc_query_lock); spin_lock_init(&idev->mc_report_lock); + mutex_init(&idev->mc_lock); ipv6_mc_reset(idev); }
@@ -2722,7 +2793,9 @@ void ipv6_mc_destroy_dev(struct inet6_dev *idev)
/* Deactivate works */ ipv6_mc_down(idev); + mutex_lock(&idev->mc_lock); mld_clear_delrec(idev); + mutex_unlock(&idev->mc_lock); mld_clear_query(idev); mld_clear_report(idev);
@@ -2736,12 +2809,14 @@ void ipv6_mc_destroy_dev(struct inet6_dev *idev) if (idev->cnf.forwarding) __ipv6_dev_mc_dec(idev, &in6addr_linklocal_allrouters);
- while ((i = rtnl_dereference(idev->mc_list))) { - rcu_assign_pointer(idev->mc_list, rtnl_dereference(i->next)); + mutex_lock(&idev->mc_lock); + while ((i = mc_dereference(idev->mc_list, idev))) { + rcu_assign_pointer(idev->mc_list, mc_dereference(i->next, idev));
ip6_mc_clear_src(i); ma_put(i); } + mutex_unlock(&idev->mc_lock); }
static void ipv6_mc_rejoin_groups(struct inet6_dev *idev) @@ -2750,12 +2825,14 @@ static void ipv6_mc_rejoin_groups(struct inet6_dev *idev)
ASSERT_RTNL();
+ mutex_lock(&idev->mc_lock); if (mld_in_v1_mode(idev)) { - for_each_mc_rtnl(idev, pmc) + for_each_mc_mclock(idev, pmc) igmp6_join_group(pmc); } else { mld_send_report(idev, NULL); } + mutex_unlock(&idev->mc_lock); }
static int ipv6_mc_netdev_event(struct notifier_block *this,
On 3/25/21 5:16 PM, Taehee Yoo wrote:
The purpose of this lock is to avoid a bottleneck in the query/report event handler logic.
By previous patches, almost all mld data is protected by RTNL. So, the query and report event handler, which is data path logic acquires RTNL too. Therefore if a lot of query and report events are received, it uses RTNL for a long time. So it makes the control-plane bottleneck because of using RTNL. In order to avoid this bottleneck, mc_lock is added.
mc_lock protect only per-interface mld data and per-interface mld data is used in the query/report event handler logic. So, no longer rtnl_lock is needed in the query/report event handler logic. Therefore bottleneck will be disappeared by mc_lock.
What testsuite have you run exactly to validate this monster patch ?
Have you used CONFIG_LOCKDEP=y / CONFIG_DEBUG_ATOMIC_SLEEP=y ?
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com
[...]
/*
- device multicast group del
*/
- device multicast group del
int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr) { @@ -943,8 +967,9 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
ASSERT_RTNL();
- mutex_lock(&idev->mc_lock); for (map = &idev->mc_list;
(ma = rtnl_dereference(*map));
if (ipv6_addr_equal(&ma->mca_addr, addr)) { if (--ma->mca_users == 0) {(ma = mc_dereference(*map, idev)); map = &ma->next) {
This can be called with rcu_bh held, thus :
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:928 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4624, name: kworker/1:2 4 locks held by kworker/1:2/4624: #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:616 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:643 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2246 #1: ffffc90009adfda8 ((addr_chk_work).work){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2250 #2: ffffffff8d66d328 (rtnl_mutex){+.+.}-{3:3}, at: addrconf_verify_work+0xa/0x20 net/ipv6/addrconf.c:4572 #3: ffffffff8bf74300 (rcu_read_lock_bh){....}-{1:2}, at: addrconf_verify_rtnl+0x2b/0x1150 net/ipv6/addrconf.c:4459 Preemption disabled at: [<ffffffff87b39f41>] local_bh_disable include/linux/bottom_half.h:19 [inline] [<ffffffff87b39f41>] rcu_read_lock_bh include/linux/rcupdate.h:727 [inline] [<ffffffff87b39f41>] addrconf_verify_rtnl+0x41/0x1150 net/ipv6/addrconf.c:4461 CPU: 1 PID: 4624 Comm: kworker/1:2 Not tainted 5.12.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: ipv6_addrconf addrconf_verify_work Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:8328 __mutex_lock_common kernel/locking/mutex.c:928 [inline] __mutex_lock+0xa9/0x1120 kernel/locking/mutex.c:1096 __ipv6_dev_mc_dec+0x5f/0x340 net/ipv6/mcast.c:970 addrconf_leave_solict net/ipv6/addrconf.c:2182 [inline] addrconf_leave_solict net/ipv6/addrconf.c:2174 [inline] __ipv6_ifa_notify+0x5b6/0xa90 net/ipv6/addrconf.c:6077 ipv6_ifa_notify net/ipv6/addrconf.c:6100 [inline] ipv6_del_addr+0x463/0xae0 net/ipv6/addrconf.c:1294 addrconf_verify_rtnl+0xd59/0x1150 net/ipv6/addrconf.c:4488 addrconf_verify_work+0xf/0x20 net/ipv6/addrconf.c:4573 process_one_work+0x98d/0x1600 kernel/workqueue.c:2275 worker_thread+0x64c/0x1120 kernel/workqueue.c:2421 kthread+0x3b1/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
On 3/30/21 1:59 PM, Eric Dumazet wrote:
On 3/25/21 5:16 PM, Taehee Yoo wrote:
The purpose of this lock is to avoid a bottleneck in the query/report event handler logic.
By previous patches, almost all mld data is protected by RTNL. So, the query and report event handler, which is data path logic acquires RTNL too. Therefore if a lot of query and report events are received, it uses RTNL for a long time. So it makes the control-plane bottleneck because of using RTNL. In order to avoid this bottleneck, mc_lock is added.
mc_lock protect only per-interface mld data and per-interface mld data is used in the query/report event handler logic. So, no longer rtnl_lock is needed in the query/report event handler logic. Therefore bottleneck will be disappeared by mc_lock.
What testsuite have you run exactly to validate this monster patch ?
Have you used CONFIG_LOCKDEP=y / CONFIG_DEBUG_ATOMIC_SLEEP=y ?
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com
[...]
/*
- device multicast group del
*/
- device multicast group del
int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr) { @@ -943,8 +967,9 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
ASSERT_RTNL();
- mutex_lock(&idev->mc_lock); for (map = &idev->mc_list;
(ma = rtnl_dereference(*map));
if (ipv6_addr_equal(&ma->mca_addr, addr)) { if (--ma->mca_users == 0) {(ma = mc_dereference(*map, idev)); map = &ma->next) {
This can be called with rcu_bh held, thus :
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:928 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4624, name: kworker/1:2 4 locks held by kworker/1:2/4624: #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:616 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:643 [inline] #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2246 #1: ffffc90009adfda8 ((addr_chk_work).work){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2250 #2: ffffffff8d66d328 (rtnl_mutex){+.+.}-{3:3}, at: addrconf_verify_work+0xa/0x20 net/ipv6/addrconf.c:4572 #3: ffffffff8bf74300 (rcu_read_lock_bh){....}-{1:2}, at: addrconf_verify_rtnl+0x2b/0x1150 net/ipv6/addrconf.c:4459 Preemption disabled at: [<ffffffff87b39f41>] local_bh_disable include/linux/bottom_half.h:19 [inline] [<ffffffff87b39f41>] rcu_read_lock_bh include/linux/rcupdate.h:727 [inline] [<ffffffff87b39f41>] addrconf_verify_rtnl+0x41/0x1150 net/ipv6/addrconf.c:4461 CPU: 1 PID: 4624 Comm: kworker/1:2 Not tainted 5.12.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: ipv6_addrconf addrconf_verify_work Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:8328 __mutex_lock_common kernel/locking/mutex.c:928 [inline] __mutex_lock+0xa9/0x1120 kernel/locking/mutex.c:1096 __ipv6_dev_mc_dec+0x5f/0x340 net/ipv6/mcast.c:970 addrconf_leave_solict net/ipv6/addrconf.c:2182 [inline] addrconf_leave_solict net/ipv6/addrconf.c:2174 [inline] __ipv6_ifa_notify+0x5b6/0xa90 net/ipv6/addrconf.c:6077 ipv6_ifa_notify net/ipv6/addrconf.c:6100 [inline] ipv6_del_addr+0x463/0xae0 net/ipv6/addrconf.c:1294 addrconf_verify_rtnl+0xd59/0x1150 net/ipv6/addrconf.c:4488 addrconf_verify_work+0xf/0x20 net/ipv6/addrconf.c:4573 process_one_work+0x98d/0x1600 kernel/workqueue.c:2275 worker_thread+0x64c/0x1120 kernel/workqueue.c:2421 kthread+0x3b1/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
I will test this fix:
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 120073ffb666b18678e3145d91dac59fa865a592..8f3883f4cb4a15a0749b8f0fe00061e483ea26ca 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4485,7 +4485,9 @@ static void addrconf_verify_rtnl(void) age >= ifp->valid_lft) { spin_unlock(&ifp->lock); in6_ifa_hold(ifp); + rcu_read_unlock_bh(); ipv6_del_addr(ifp); + rcu_read_lock_bh(); goto restart; } else if (ifp->prefered_lft == INFINITY_LIFE_TIME) { spin_unlock(&ifp->lock);
On 3/30/21 9:24 PM, Eric Dumazet wrote:
On 3/30/21 1:59 PM, Eric Dumazet wrote:
On 3/25/21 5:16 PM, Taehee Yoo wrote:
The purpose of this lock is to avoid a bottleneck in the query/report event handler logic.
By previous patches, almost all mld data is protected by RTNL. So, the query and report event handler, which is data path logic acquires RTNL too. Therefore if a lot of query and report events are received, it uses RTNL for a long time. So it makes the control-plane bottleneck because of using RTNL. In order to avoid this bottleneck, mc_lock is added.
mc_lock protect only per-interface mld data and per-interface mld data is used in the query/report event handler logic. So, no longer rtnl_lock is needed in the query/report event handler
logic.
Therefore bottleneck will be disappeared by mc_lock.
What testsuite have you run exactly to validate this monster patch ?
I've been using an application, which calls setsockopt() with the below options. IPV6_ADD_MEMBERSHIP IPV6_DROP_MEMBERSHIP MCAST_JOIN_SOURCE_GROUP MCAST_LEAVE_SOURCE_GROUP MCAST_BLOCK_SOURCE MCAST_UNBLOCK_SOURCE MCAST_MSFILTER And checks out /proc/net/mcfilter6 and /proc/net/igmp6.
Have you used CONFIG_LOCKDEP=y / CONFIG_DEBUG_ATOMIC_SLEEP=y ?
Yes, I'm using both configs.
Suggested-by: Cong Wang xiyou.wangcong@gmail.com Signed-off-by: Taehee Yoo ap420073@gmail.com
[...]
/*
- device multicast group del
*/ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct
- device multicast group del
in6_addr *addr)
{ @@ -943,8 +967,9 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev,
const struct in6_addr *addr)
ASSERT_RTNL();
- mutex_lock(&idev->mc_lock); for (map = &idev->mc_list;
(ma = rtnl_dereference(*map));
if (ipv6_addr_equal(&ma->mca_addr, addr)) { if (--ma->mca_users == 0) {(ma = mc_dereference(*map, idev)); map = &ma->next) {
This can be called with rcu_bh held, thus :
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:928
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4624, name:
kworker/1:2
4 locks held by kworker/1:2/4624: #0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0},
at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0},
at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline]
#0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0},
at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline]
#0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0},
at: set_work_data kernel/workqueue.c:616 [inline]
#0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0},
at: set_work_pool_and_clear_pending kernel/workqueue.c:643 [inline]
#0: ffff88802135d138 ((wq_completion)ipv6_addrconf){+.+.}-{0:0},
at: process_one_work+0x871/0x1600 kernel/workqueue.c:2246
#1: ffffc90009adfda8 ((addr_chk_work).work){+.+.}-{0:0}, at:
process_one_work+0x8a5/0x1600 kernel/workqueue.c:2250
#2: ffffffff8d66d328 (rtnl_mutex){+.+.}-{3:3}, at:
addrconf_verify_work+0xa/0x20 net/ipv6/addrconf.c:4572
#3: ffffffff8bf74300 (rcu_read_lock_bh){....}-{1:2}, at:
addrconf_verify_rtnl+0x2b/0x1150 net/ipv6/addrconf.c:4459
Preemption disabled at: [<ffffffff87b39f41>] local_bh_disable include/linux/bottom_half.h:19
[inline]
[<ffffffff87b39f41>] rcu_read_lock_bh include/linux/rcupdate.h:727
[inline]
[<ffffffff87b39f41>] addrconf_verify_rtnl+0x41/0x1150
net/ipv6/addrconf.c:4461
CPU: 1 PID: 4624 Comm: kworker/1:2 Not tainted 5.12.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_verify_work Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:8328 __mutex_lock_common kernel/locking/mutex.c:928 [inline] __mutex_lock+0xa9/0x1120 kernel/locking/mutex.c:1096 __ipv6_dev_mc_dec+0x5f/0x340 net/ipv6/mcast.c:970 addrconf_leave_solict net/ipv6/addrconf.c:2182 [inline] addrconf_leave_solict net/ipv6/addrconf.c:2174 [inline] __ipv6_ifa_notify+0x5b6/0xa90 net/ipv6/addrconf.c:6077 ipv6_ifa_notify net/ipv6/addrconf.c:6100 [inline] ipv6_del_addr+0x463/0xae0 net/ipv6/addrconf.c:1294 addrconf_verify_rtnl+0xd59/0x1150 net/ipv6/addrconf.c:4488 addrconf_verify_work+0xf/0x20 net/ipv6/addrconf.c:4573 process_one_work+0x98d/0x1600 kernel/workqueue.c:2275 worker_thread+0x64c/0x1120 kernel/workqueue.c:2421 kthread+0x3b1/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
I will test this fix:
Thanks a lot!
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index
120073ffb666b18678e3145d91dac59fa865a592..8f3883f4cb4a15a0749b8f0fe00061e483ea26ca 100644
--- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4485,7 +4485,9 @@ static void addrconf_verify_rtnl(void) age >= ifp->valid_lft) { spin_unlock(&ifp->lock); in6_ifa_hold(ifp);
rcu_read_unlock_bh(); ipv6_del_addr(ifp);
rcu_read_lock_bh(); goto restart; } else if (ifp->prefered_lft ==
INFINITY_LIFE_TIME) {
spin_unlock(&ifp->lock);
On 3/26/21 1:16 AM, Taehee Yoo wrote:
This patchset changes the context of MLD module. Before this patchset, MLD functions are atomic context so it couldn't use sleepable functions and flags.
There are several reasons why MLD functions are under atomic context.
- It uses timer API.
Timer expiration functions are executed in the atomic context. 2. atomic locks MLD functions use rwlock and spinlock to protect their own resources.
So, in order to switch context, this patchset converts resources to use RCU and removes atomic locks and timer API.
- The first patch convert from the timer API to delayed work.
Timer API is used for delaying some works. MLD protocol has a delay mechanism, which is used for replying to a query. If a listener receives a query from a router, it should send a response after some delay. But because of timer expire function is executed in the atomic context, this patch convert from timer API to the delayed work.
- The fourth patch deletes inet6_dev->mc_lock.
The mc_lock has protected inet6_dev->mc_tomb pointer. But this pointer is already protected by RTNL and it isn't be used by datapath. So, it isn't be needed and because of this, many atomic context critical sections are deleted.
- The fifth patch convert ip6_sf_socklist to RCU.
ip6_sf_socklist has been protected by ipv6_mc_socklist->sflock(rwlock). But this is already protected by RTNL So if it is converted to use RCU in order to be used in the datapath, the sflock is no more needed. So, its control path context can be switched to sleepable.
- The sixth patch convert ip6_sf_list to RCU.
The reason for this patch is the same as the previous patch.
- The seventh patch convert ifmcaddr6 to RCU.
The reason for this patch is the same as the previous patch.
- Add new workqueues for processing query/report event.
By this patch, query and report events are processed by workqueue So context is sleepable, not atomic. While this logic, it acquires RTNL.
- Add new mc_lock.
The purpose of this lock is to protect per-interface mld data. Per-interface mld data is usually used by query/report event handler. So, query/report event workers need only this lock instead of RTNL. Therefore, it could reduce bottleneck.
Changelog: v2 -> v3:
- Do not use msecs_to_jiffies().
(by Cong Wang) 2. Do not add unnecessary rtnl_lock() and rtnl_unlock(). (by Cong Wang) 3. Fix sparse warnings because of rcu annotation. (by kernel test robot) - Remove some rcu_assign_pointer(), which was used for non-rcu pointer. - Add union for rcu pointer. - Use rcu API in mld_clear_zeros(). - Remove remained rcu_read_unlock(). - Use rcu API for tomb resources. 4. withdraw prevopus 2nd and 3rd patch. - "separate two flags from ifmcaddr6->mca_flags" - "add a new delayed_work, mc_delrec_work" 5. Add 6th and 7th patch.
v1 -> v2:
- Withdraw unnecessary refactoring patches.
(by Cong Wang, Eric Dumazet, David Ahern) a) convert from array to list. b) function rename. 2. Separate big one patch into small several patches. 3. Do not rename 'ifmcaddr6->mca_lock'. In the v1 patch, this variable was changed to 'ifmcaddr6->mca_work_lock'. But this is actually not needed. 4. Do not use atomic_t for 'ifmcaddr6->mca_sfcount' and 'ipv6_mc_socklist'->sf_count'. 5. Do not add mld_check_leave_group() function. 6. Do not add ip6_mc_del_src_bulk() function. 7. Do not add ip6_mc_add_src_bulk() function. 8. Do not use rcu_read_lock() in the qeth_l3_add_mcast_rtnl(). (by Julian Wiedmann)
Taehee Yoo (7): mld: convert from timer to delayed work mld: get rid of inet6_dev->mc_lock mld: convert ipv6_mc_socklist->sflist to RCU mld: convert ip6_sf_list to RCU mld: convert ifmcaddr6 to RCU mld: add new workqueues for process mld events mld: add mc_lock for protecting per-interface mld data
drivers/s390/net/qeth_l3_main.c | 6 +- include/net/if_inet6.h | 37 +- include/net/mld.h | 3 + net/batman-adv/multicast.c | 6 +- net/ipv6/addrconf.c | 9 +- net/ipv6/addrconf_core.c | 2 +- net/ipv6/af_inet6.c | 2 +- net/ipv6/icmp.c | 4 +- net/ipv6/mcast.c | 1080 ++++++++++++++++++------------- 9 files changed, 678 insertions(+), 471 deletions(-)
b.a.t.m.a.n@lists.open-mesh.org