Commit f736e0f1 authored by Paul E. McKenney's avatar Paul E. McKenney
Browse files

Merge branches 'fixes.2020.04.27a', 'kfree_rcu.2020.04.27a',...

Merge branches 'fixes.2020.04.27a', 'kfree_rcu.2020.04.27a', 'rcu-tasks.2020.04.27a', 'stall.2020.04.27a' and 'torture.2020.05.07a' into HEAD

fixes.2020.04.27a:  Miscellaneous fixes.
kfree_rcu.2020.04.27a:  Changes related to kfree_rcu().
rcu-tasks.2020.04.27a:  Addition of new RCU-tasks flavors.
stall.2020.04.27a:  RCU CPU stall-warning updates.
torture.2020.05.07a:  Torture-test updates.
Loading
Loading
Loading
Loading
+16 −45
Original line number Diff line number Diff line
@@ -1943,56 +1943,27 @@ invoked from a CPU-hotplug notifier.
Scheduler and RCU
~~~~~~~~~~~~~~~~~

RCU depends on the scheduler, and the scheduler uses RCU to protect some
of its data structures. The preemptible-RCU ``rcu_read_unlock()``
implementation must therefore be written carefully to avoid deadlocks
involving the scheduler's runqueue and priority-inheritance locks. In
particular, ``rcu_read_unlock()`` must tolerate an interrupt where the
interrupt handler invokes both ``rcu_read_lock()`` and
``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()``
to use negative nesting levels to avoid destructive recursion via
interrupt handler's use of RCU.

This scheduler-RCU requirement came as a `complete
surprise <https://lwn.net/Articles/453002/>`__.

As noted above, RCU makes use of kthreads, and it is necessary to avoid
excessive CPU-time accumulation by these kthreads. This requirement was
no surprise, but RCU's violation of it when running context-switch-heavy
workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a
surprise
RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
accumulation by these kthreads. This requirement was no surprise, but
RCU's violation of it when running context-switch-heavy workloads when
built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
RCU has made good progress towards meeting this requirement, even for
context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
room for further improvement.

It is forbidden to hold any of scheduler's runqueue or
priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless
interrupts have been disabled across the entire RCU read-side critical
section, that is, up to and including the matching ``rcu_read_lock()``.
Violating this restriction can result in deadlocks involving these
scheduler spinlocks. There was hope that this restriction might be
lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started
deferring the reporting of the resulting RCU-preempt quiescent state
until the end of the corresponding interrupts-disabled region.
Unfortunately, timely reporting of the corresponding quiescent state to
expedited grace periods requires a call to ``raise_softirq()``, which
can acquire these scheduler spinlocks. In addition, real-time systems
using RCU priority boosting need this restriction to remain in effect
because deferred quiescent-state reporting would also defer deboosting,
which in turn would degrade real-time latencies.

In theory, if a given RCU read-side critical section could be guaranteed
to be less than one second in duration, holding a scheduler spinlock
across that critical section's ``rcu_read_unlock()`` would require only
that preemption be disabled across the entire RCU read-side critical
section, not interrupts. Unfortunately, given the possibility of vCPU
preemption, long-running interrupts, and so on, it is not possible in
practice to guarantee that a given RCU read-side critical section will
complete in less than one second. Therefore, as noted above, if
scheduler spinlocks are held across a given call to
``rcu_read_unlock()``, interrupts must be disabled across the entire RCU
read-side critical section.
There is no longer any prohibition against holding any of
scheduler's runqueue or priority-inheritance spinlocks across an
``rcu_read_unlock()``, even if interrupts and preemption were enabled
somewhere within the corresponding RCU read-side critical section.
Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
with preemption enabled, acquire one of the scheduler locks, and hold
that lock across the matching ``rcu_read_unlock()``.

Similarly, the RCU flavor consolidation has removed the need for negative
nesting.  The fact that interrupt-disabled regions of code act as RCU
read-side critical sections implicitly avoids earlier issues that used
to result in destructive recursion via interrupt handler's use of RCU.

Tracing and RCU
~~~~~~~~~~~~~~~
+19 −0
Original line number Diff line number Diff line
@@ -4210,12 +4210,24 @@
			Duration of CPU stall (s) to test RCU CPU stall
			warnings, zero to disable.

	rcutorture.stall_cpu_block= [KNL]
			Sleep while stalling if set.  This will result
			in warnings from preemptible RCU in addition
			to any other stall-related activity.

	rcutorture.stall_cpu_holdoff= [KNL]
			Time to wait (s) after boot before inducing stall.

	rcutorture.stall_cpu_irqsoff= [KNL]
			Disable interrupts while stalling if set.

	rcutorture.stall_gp_kthread= [KNL]
			Duration (s) of forced sleep within RCU
			grace-period kthread to test RCU CPU stall
			warnings, zero to disable.  If both stall_cpu
			and stall_gp_kthread are specified, the
			kthread is starved first, then the CPU.

	rcutorture.stat_interval= [KNL]
			Time (s) between statistics printk()s.

@@ -4286,6 +4298,13 @@
			only normal grace-period primitives.  No effect
			on CONFIG_TINY_RCU kernels.

	rcupdate.rcu_task_ipi_delay= [KNL]
			Set time in jiffies during which RCU tasks will
			avoid sending IPIs, starting with the beginning
			of a given grace period.  Setting a large
			number avoids disturbing real-time workloads,
			but lengthens grace periods.

	rcupdate.rcu_task_stall_timeout= [KNL]
			Set timeout in jiffies for RCU task stall warning
			messages.  Disable with a value less than or equal
+43 −10
Original line number Diff line number Diff line
@@ -37,6 +37,7 @@
/* Exported common interfaces */
void call_rcu(struct rcu_head *head, rcu_callback_t func);
void rcu_barrier_tasks(void);
void rcu_barrier_tasks_rude(void);
void synchronize_rcu(void);

#ifdef CONFIG_PREEMPT_RCU
@@ -129,25 +130,57 @@ static inline void rcu_init_nohz(void) { }
 * Note a quasi-voluntary context switch for RCU-tasks's benefit.
 * This is a macro rather than an inline function to avoid #include hell.
 */
#ifdef CONFIG_TASKS_RCU_GENERIC

# ifdef CONFIG_TASKS_RCU
#define rcu_tasks_qs(t) \
# define rcu_tasks_classic_qs(t, preempt)				\
	do {								\
		if (READ_ONCE((t)->rcu_tasks_holdout)) \
		if (!(preempt) && READ_ONCE((t)->rcu_tasks_holdout))	\
			WRITE_ONCE((t)->rcu_tasks_holdout, false);	\
	} while (0)
#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t)
void call_rcu_tasks(struct rcu_head *head, rcu_callback_t func);
void synchronize_rcu_tasks(void);
# else
# define rcu_tasks_classic_qs(t, preempt) do { } while (0)
# define call_rcu_tasks call_rcu
# define synchronize_rcu_tasks synchronize_rcu
# endif

# ifdef CONFIG_TASKS_RCU_TRACE
# define rcu_tasks_trace_qs(t)						\
	do {								\
		if (!likely(READ_ONCE((t)->trc_reader_checked)) &&	\
		    !unlikely(READ_ONCE((t)->trc_reader_nesting))) {	\
			smp_store_release(&(t)->trc_reader_checked, true); \
			smp_mb(); /* Readers partitioned by store. */	\
		}							\
	} while (0)
# else
# define rcu_tasks_trace_qs(t) do { } while (0)
# endif

#define rcu_tasks_qs(t, preempt)					\
do {									\
	rcu_tasks_classic_qs((t), (preempt));				\
	rcu_tasks_trace_qs((t));					\
} while (0)

# ifdef CONFIG_TASKS_RUDE_RCU
void call_rcu_tasks_rude(struct rcu_head *head, rcu_callback_t func);
void synchronize_rcu_tasks_rude(void);
# endif

#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t, false)
void exit_tasks_rcu_start(void);
void exit_tasks_rcu_finish(void);
#else /* #ifdef CONFIG_TASKS_RCU */
#define rcu_tasks_qs(t)	do { } while (0)
#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
#define rcu_tasks_qs(t, preempt) do { } while (0)
#define rcu_note_voluntary_context_switch(t) do { } while (0)
#define call_rcu_tasks call_rcu
#define synchronize_rcu_tasks synchronize_rcu
static inline void exit_tasks_rcu_start(void) { }
static inline void exit_tasks_rcu_finish(void) { }
#endif /* #else #ifdef CONFIG_TASKS_RCU */
#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */

/**
 * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
@@ -158,7 +191,7 @@ static inline void exit_tasks_rcu_finish(void) { }
 */
#define cond_resched_tasks_rcu_qs() \
do { \
	rcu_tasks_qs(current); \
	rcu_tasks_qs(current, false); \
	cond_resched(); \
} while (0)

+88 −0
Original line number Diff line number Diff line
/* SPDX-License-Identifier: GPL-2.0+ */
/*
 * Read-Copy Update mechanism for mutual exclusion, adapted for tracing.
 *
 * Copyright (C) 2020 Paul E. McKenney.
 */

#ifndef __LINUX_RCUPDATE_TRACE_H
#define __LINUX_RCUPDATE_TRACE_H

#include <linux/sched.h>
#include <linux/rcupdate.h>

#ifdef CONFIG_DEBUG_LOCK_ALLOC

extern struct lockdep_map rcu_trace_lock_map;

static inline int rcu_read_lock_trace_held(void)
{
	return lock_is_held(&rcu_trace_lock_map);
}

#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */

static inline int rcu_read_lock_trace_held(void)
{
	return 1;
}

#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */

#ifdef CONFIG_TASKS_TRACE_RCU

void rcu_read_unlock_trace_special(struct task_struct *t, int nesting);

/**
 * rcu_read_lock_trace - mark beginning of RCU-trace read-side critical section
 *
 * When synchronize_rcu_trace() is invoked by one task, then that task
 * is guaranteed to block until all other tasks exit their read-side
 * critical sections.  Similarly, if call_rcu_trace() is invoked on one
 * task while other tasks are within RCU read-side critical sections,
 * invocation of the corresponding RCU callback is deferred until after
 * the all the other tasks exit their critical sections.
 *
 * For more details, please see the documentation for rcu_read_lock().
 */
static inline void rcu_read_lock_trace(void)
{
	struct task_struct *t = current;

	WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
	    t->trc_reader_special.b.need_mb)
		smp_mb(); // Pairs with update-side barriers
	rcu_lock_acquire(&rcu_trace_lock_map);
}

/**
 * rcu_read_unlock_trace - mark end of RCU-trace read-side critical section
 *
 * Pairs with a preceding call to rcu_read_lock_trace(), and nesting is
 * allowed.  Invoking a rcu_read_unlock_trace() when there is no matching
 * rcu_read_lock_trace() is verboten, and will result in lockdep complaints.
 *
 * For more details, please see the documentation for rcu_read_unlock().
 */
static inline void rcu_read_unlock_trace(void)
{
	int nesting;
	struct task_struct *t = current;

	rcu_lock_release(&rcu_trace_lock_map);
	nesting = READ_ONCE(t->trc_reader_nesting) - 1;
	if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
		WRITE_ONCE(t->trc_reader_nesting, nesting);
		return;  // We assume shallow reader nesting.
	}
	rcu_read_unlock_trace_special(t, nesting);
}

void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func);
void synchronize_rcu_tasks_trace(void);
void rcu_barrier_tasks_trace(void);

#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */

#endif /* __LINUX_RCUPDATE_TRACE_H */
+19 −0
Original line number Diff line number Diff line
@@ -31,4 +31,23 @@ do { \

#define wait_rcu_gp(...) _wait_rcu_gp(false, __VA_ARGS__)

/**
 * synchronize_rcu_mult - Wait concurrently for multiple grace periods
 * @...: List of call_rcu() functions for different grace periods to wait on
 *
 * This macro waits concurrently for multiple types of RCU grace periods.
 * For example, synchronize_rcu_mult(call_rcu, call_rcu_tasks) would wait
 * on concurrent RCU and RCU-tasks grace periods.  Waiting on a given SRCU
 * domain requires you to write a wrapper function for that SRCU domain's
 * call_srcu() function, with this wrapper supplying the pointer to the
 * corresponding srcu_struct.
 *
 * The first argument tells Tiny RCU's _wait_rcu_gp() not to
 * bother waiting for RCU.  The reason for this is because anywhere
 * synchronize_rcu_mult() can be called is automatically already a full
 * grace period.
 */
#define synchronize_rcu_mult(...) \
	_wait_rcu_gp(IS_ENABLED(CONFIG_TINY_RCU), __VA_ARGS__)

#endif /* _LINUX_SCHED_RCUPDATE_WAIT_H */
Loading