Commit 95d1815f authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.

2) Add DATA_SENT state to SCTP connection tracking helper, from
   Sriram Yagnaraman.

3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.

4) Add bitmask support for ipset, from Vishwanath Pai.

5) Handle icmpv6 redirects as RELATED, from Florian Westphal.

6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
   from Li Qiong.

7) A large batch of IPVS updates to replace timer-based estimators by
   kthreads to scale up wrt. CPUs and workload (millions of estimators).

Julian Anastasov says:

	This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.

	Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules.  When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.

RCU Locking:

- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.

Kthread data:

- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.

- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.

- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators

- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty

- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.

- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.

- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.

Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.

The new IPVS estimators do not use workqueue infrastructure
because:

- The estimation can take long time when using multiple IPVS rules (eg.
  millions estimator structures) and especially when box has multiple
  CPUs due to the for_each_possible_cpu usage that expects packets from
  any CPU. With est_nice sysctl we have more control how to prioritize the
  estimation kthreads compared to other processes/kthreads that have
  latency requirements (such as servers). As a benefit, we can see these
  kthreads in top and decide if we will need some further control to limit
  their CPU usage (max number of structure to estimate per kthread).

- with kthreads we run code that is read-mostly, no write/lock
  operations to process the estimators in 2-second intervals.

- work items are one-shot: as estimators are processed every
  2 seconds, they need to be re-added every time. This again
  loads the timers (add_timer) if we use delayed works, as there are
  no kthreads to do the timings.

[1] Report from Yunhong Jiang:
    https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
    https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  ipvs: run_estimation should control the kthread tasks
  ipvs: add est_cpulist and est_nice sysctl vars
  ipvs: use kthreads for stats estimation
  ipvs: use u64_stats_t for the per-cpu counters
  ipvs: use common functions for stats allocation
  ipvs: add rcu protection to stats
  netfilter: flowtable: add a 'default' case to flowtable datapath
  netfilter: conntrack: set icmpv6 redirects as RELATED
  netfilter: ipset: Add support for new bitmask parameter
  netfilter: conntrack: merge ipv4+ipv6 confirm functions
  netfilter: conntrack: add sctp DATA_SENT state
  netfilter: nft_inner: fix IS_ERR() vs NULL check
====================

Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 15eb1621 144361c1
Loading
Loading
Loading
Loading
+22 −2
Original line number Diff line number Diff line
@@ -129,6 +129,26 @@ drop_packet - INTEGER
	threshold. When the mode 3 is set, the always mode drop rate
	is controlled by the /proc/sys/net/ipv4/vs/am_droprate.

est_cpulist - CPULIST
	Allowed	CPUs for estimation kthreads

	Syntax: standard cpulist format
	empty list - stop kthread tasks and estimation
	default - the system's housekeeping CPUs for kthreads

	Example:
	"all": all possible CPUs
	"0-N": all possible CPUs, N denotes last CPU number
	"0,1-N:1/2": first and all CPUs with odd number
	"": empty list

est_nice - INTEGER
	default 0
	Valid range: -20 (more favorable) .. 19 (less favorable)

	Niceness value to use for the estimation kthreads (scheduling
	priority)

expire_nodest_conn - BOOLEAN
	- 0 - disabled (default)
	- not 0 - enabled
@@ -304,8 +324,8 @@ run_estimation - BOOLEAN
	0 - disabled
	not 0 - enabled (default)

	If disabled, the estimation will be stop, and you can't see
	any update on speed estimation data.
	If disabled, the estimation will be suspended and kthread tasks
	stopped.

	You can always re-enable estimation by setting this value to 1.
	But be careful, the first estimation after re-enable is not
+10 −0
Original line number Diff line number Diff line
@@ -515,6 +515,16 @@ ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo,
	*skbinfo = ext->skbinfo;
}

static inline void
nf_inet_addr_mask_inplace(union nf_inet_addr *a1,
			  const union nf_inet_addr *mask)
{
	a1->all[0] &= mask->all[0];
	a1->all[1] &= mask->all[1];
	a1->all[2] &= mask->all[2];
	a1->all[3] &= mask->all[3];
}

#define IP_SET_INIT_KEXT(skb, opt, set)			\
	{ .bytes = (skb)->len, .packets = 1, .target = true,\
	  .timeout = ip_set_adt_opt_timeout(opt, set) }
+160 −11
Original line number Diff line number Diff line
@@ -29,6 +29,7 @@
#include <net/netfilter/nf_conntrack.h>
#endif
#include <net/net_namespace.h>		/* Netw namespace */
#include <linux/sched/isolation.h>

#define IP_VS_HDR_INVERSE	1
#define IP_VS_HDR_ICMP		2
@@ -42,6 +43,8 @@ static inline struct netns_ipvs *net_ipvs(struct net* net)
/* Connections' size value needed by ip_vs_ctl.c */
extern int ip_vs_conn_tab_size;

extern struct mutex __ip_vs_mutex;

struct ip_vs_iphdr {
	int hdr_flags;	/* ipvs flags */
	__u32 off;	/* Where IP or IPv4 header starts */
@@ -351,11 +354,11 @@ struct ip_vs_seq {

/* counters per cpu */
struct ip_vs_counters {
	__u64		conns;		/* connections scheduled */
	__u64		inpkts;		/* incoming packets */
	__u64		outpkts;	/* outgoing packets */
	__u64		inbytes;	/* incoming bytes */
	__u64		outbytes;	/* outgoing bytes */
	u64_stats_t	conns;		/* connections scheduled */
	u64_stats_t	inpkts;		/* incoming packets */
	u64_stats_t	outpkts;	/* outgoing packets */
	u64_stats_t	inbytes;	/* incoming bytes */
	u64_stats_t	outbytes;	/* outgoing bytes */
};
/* Stats per cpu */
struct ip_vs_cpu_stats {
@@ -363,9 +366,12 @@ struct ip_vs_cpu_stats {
	struct u64_stats_sync   syncp;
};

/* Default nice for estimator kthreads */
#define IPVS_EST_NICE		0

/* IPVS statistics objects */
struct ip_vs_estimator {
	struct list_head	list;
	struct hlist_node	list;

	u64			last_inbytes;
	u64			last_outbytes;
@@ -378,6 +384,10 @@ struct ip_vs_estimator {
	u64			outpps;
	u64			inbps;
	u64			outbps;

	s32			ktid:16,	/* kthread ID, -1=temp list */
				ktrow:8,	/* row/tick ID for kthread */
				ktcid:8;	/* chain ID for kthread tick */
};

/*
@@ -405,6 +415,76 @@ struct ip_vs_stats {
	struct ip_vs_kstats	kstats0;	/* reset values */
};

struct ip_vs_stats_rcu {
	struct ip_vs_stats	s;
	struct rcu_head		rcu_head;
};

int ip_vs_stats_init_alloc(struct ip_vs_stats *s);
struct ip_vs_stats *ip_vs_stats_alloc(void);
void ip_vs_stats_release(struct ip_vs_stats *stats);
void ip_vs_stats_free(struct ip_vs_stats *stats);

/* Process estimators in multiple timer ticks (20/50/100, see ktrow) */
#define IPVS_EST_NTICKS		50
/* Estimation uses a 2-second period containing ticks (in jiffies) */
#define IPVS_EST_TICK		((2 * HZ) / IPVS_EST_NTICKS)

/* Limit of CPU load per kthread (8 for 12.5%), ratio of CPU capacity (1/C).
 * Value of 4 and above ensures kthreads will take work without exceeding
 * the CPU capacity under different circumstances.
 */
#define IPVS_EST_LOAD_DIVISOR	8

/* Kthreads should not have work that exceeds the CPU load above 50% */
#define IPVS_EST_CPU_KTHREADS	(IPVS_EST_LOAD_DIVISOR / 2)

/* Desired number of chains per timer tick (chain load factor in 100us units),
 * 48=4.8ms of 40ms tick (12% CPU usage):
 * 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / 50
 */
#define IPVS_EST_CHAIN_FACTOR	\
	ALIGN_DOWN(2 * 1000 * 10 / IPVS_EST_LOAD_DIVISOR / IPVS_EST_NTICKS, 8)

/* Compiled number of chains per tick
 * The defines should match cond_resched_rcu
 */
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
#define IPVS_EST_TICK_CHAINS	IPVS_EST_CHAIN_FACTOR
#else
#define IPVS_EST_TICK_CHAINS	1
#endif

#if IPVS_EST_NTICKS > 127
#error Too many timer ticks for ktrow
#endif

/* Multiple chains processed in same tick */
struct ip_vs_est_tick_data {
	struct hlist_head	chains[IPVS_EST_TICK_CHAINS];
	DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS);
	DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS);
	int			chain_len[IPVS_EST_TICK_CHAINS];
};

/* Context for estimation kthread */
struct ip_vs_est_kt_data {
	struct netns_ipvs	*ipvs;
	struct task_struct	*task;		/* task if running */
	struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS];
	DECLARE_BITMAP(avail, IPVS_EST_NTICKS);	/* tick has space for ests */
	unsigned long		est_timer;	/* estimation timer (jiffies) */
	struct ip_vs_stats	*calc_stats;	/* Used for calculation */
	int			tick_len[IPVS_EST_NTICKS];	/* est count */
	int			id;		/* ktid per netns */
	int			chain_max;	/* max ests per tick chain */
	int			tick_max;	/* max ests per tick */
	int			est_count;	/* attached ests to kthread */
	int			est_max_count;	/* max ests per kthread */
	int			add_row;	/* row for new ests */
	int			est_row;	/* estimated row */
};

struct dst_entry;
struct iphdr;
struct ip_vs_conn;
@@ -688,6 +768,7 @@ struct ip_vs_dest {
	union nf_inet_addr	vaddr;		/* virtual IP address */
	__u32			vfwmark;	/* firewall mark of service */

	struct rcu_head		rcu_head;
	struct list_head	t_list;		/* in dest_trash */
	unsigned int		in_rs_table:1;	/* we are in rs_table */
};
@@ -869,7 +950,7 @@ struct netns_ipvs {
	atomic_t		conn_count;      /* connection counter */

	/* ip_vs_ctl */
	struct ip_vs_stats		tot_stats;  /* Statistics & est. */
	struct ip_vs_stats_rcu	*tot_stats;      /* Statistics & est. */

	int			num_services;    /* no of virtual services */
	int			num_services6;   /* IPv6 virtual services */
@@ -932,6 +1013,12 @@ struct netns_ipvs {
	int			sysctl_schedule_icmp;
	int			sysctl_ignore_tunneled;
	int			sysctl_run_estimation;
#ifdef CONFIG_SYSCTL
	cpumask_var_t		sysctl_est_cpulist;	/* kthread cpumask */
	int			est_cpulist_valid;	/* cpulist set */
	int			sysctl_est_nice;	/* kthread nice */
	int			est_stopped;		/* stop tasks */
#endif

	/* ip_vs_lblc */
	int			sysctl_lblc_expiration;
@@ -942,9 +1029,17 @@ struct netns_ipvs {
	struct ctl_table_header	*lblcr_ctl_header;
	struct ctl_table	*lblcr_ctl_table;
	/* ip_vs_est */
	struct list_head	est_list;	/* estimator list */
	spinlock_t		est_lock;
	struct timer_list	est_timer;	/* Estimation timer */
	struct delayed_work	est_reload_work;/* Reload kthread tasks */
	struct mutex		est_mutex;	/* protect kthread tasks */
	struct hlist_head	est_temp_list;	/* Ests during calc phase */
	struct ip_vs_est_kt_data **est_kt_arr;	/* Array of kthread data ptrs */
	unsigned long		est_max_threads;/* Hard limit of kthreads */
	int			est_calc_phase;	/* Calculation phase */
	int			est_chain_max;	/* Calculated chain_max */
	int			est_kt_count;	/* Allocated ptrs */
	int			est_add_ktid;	/* ktid where to add ests */
	atomic_t		est_genid;	/* kthreads reload genid */
	atomic_t		est_genid_done;	/* applied genid */
	/* ip_vs_sync */
	spinlock_t		sync_lock;
	struct ipvs_master_sync_state *ms;
@@ -1077,6 +1172,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
	return ipvs->sysctl_run_estimation;
}

static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
{
	if (ipvs->est_cpulist_valid)
		return ipvs->sysctl_est_cpulist;
	else
		return housekeeping_cpumask(HK_TYPE_KTHREAD);
}

static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
{
	return ipvs->sysctl_est_nice;
}

#else

static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1174,6 +1282,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
	return 1;
}

static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
{
	return housekeeping_cpumask(HK_TYPE_KTHREAD);
}

static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
{
	return IPVS_EST_NICE;
}

#endif

/* IPVS core functions
@@ -1475,10 +1593,41 @@ int stop_sync_thread(struct netns_ipvs *ipvs, int state);
void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts);

/* IPVS rate estimator prototypes (from ip_vs_est.c) */
void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
void ip_vs_zero_estimator(struct ip_vs_stats *stats);
void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
void ip_vs_est_reload_start(struct netns_ipvs *ipvs);
int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
			    struct ip_vs_est_kt_data *kd);
void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);

static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
{
#ifdef CONFIG_SYSCTL
	/* Stop tasks while cpulist is empty or if disabled with flag */
	ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
			    (ipvs->est_cpulist_valid &&
			     cpumask_empty(sysctl_est_cpulist(ipvs)));
#endif
}

static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
{
#ifdef CONFIG_SYSCTL
	return ipvs->est_stopped;
#else
	return false;
#endif
}

static inline int ip_vs_est_max_threads(struct netns_ipvs *ipvs)
{
	unsigned int limit = IPVS_EST_CPU_KTHREADS *
			     cpumask_weight(sysctl_est_cpulist(ipvs));

	return max(1U, limit);
}

/* Various IPVS packet transmitters (from ip_vs_xmit.c) */
int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
+1 −2
Original line number Diff line number Diff line
@@ -71,8 +71,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
	return ret;
}

unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff,
			struct nf_conn *ct, enum ip_conntrack_info ctinfo);
unsigned int nf_confirm(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);

void print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
		 const struct nf_conntrack_l4proto *proto);
+2 −0
Original line number Diff line number Diff line
@@ -85,6 +85,7 @@ enum {
	IPSET_ATTR_CADT_LINENO = IPSET_ATTR_LINENO,	/* 9 */
	IPSET_ATTR_MARK,	/* 10 */
	IPSET_ATTR_MARKMASK,	/* 11 */
	IPSET_ATTR_BITMASK,	/* 12 */
	/* Reserve empty slots */
	IPSET_ATTR_CADT_MAX = 16,
	/* Create-only specific attributes */
@@ -153,6 +154,7 @@ enum ipset_errno {
	IPSET_ERR_COMMENT,
	IPSET_ERR_INVALID_MARKMASK,
	IPSET_ERR_SKBINFO,
	IPSET_ERR_BITMASK_NETMASK_EXCL,

	/* Type specific error codes */
	IPSET_ERR_TYPE_SPECIFIC = 4352,
Loading