Unverified Commit 717f8e51 authored by openeuler-ci-bot's avatar openeuler-ci-bot Committed by Gitee
Browse files

!3066 rcu: Add RCU stall diagnosis information

Merge Pull Request from: @ci-robot 
 
PR sync from: Zhen Lei <thunder.leizhen@huawei.com>
https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P6C4E4WHUS76SMDAU4NM6HFAKILY7H7P/ 
v2 --> v3:
Add "# CONFIG_RCU_CPU_STALL_CPUTIME is not set" into openeuler_defconfig.

v1 --> v2:
Remove patch 1 in v1: rcu: Prevent lockdep-RCU splats on lock acquisition/release
It had arleady been merged.

v1:
In some extreme cases, such as the I/O pressure test, the CPU usage may
be 100%, causing RCU stall. In this case, the printed information about
current is not useful. Displays the number and usage of hard interrupts,
soft interrupts, and context switches that are generated within half of
the CPU stall timeout, can help us make a general judgment. In other
cases, we can preliminarily determine whether an infinite loop occurs
when local_irq, local_bh or preempt is disabled.

Neeraj Upadhyay (1):
  rcu: Check and report missed fqs timer wakeup on RCU stall

Paul E. McKenney (2):
  rcu: For RCU grace-period kthread starvation, dump last CPU it ran on
  rcu: Do not NMI offline CPUs

Zhen Lei (7):
  sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()
  sched/debug: Show the registers of 'current' in dump_cpu_task()
  sched: Add helper kstat_cpu_softirqs_sum()
  sched: Add helper nr_context_switches_cpu()
  rcu: Add RCU stall diagnosis information
  rcu: Align the output of RCU CPU stall warning messages
  config: update openeuler_defconfig for arm64 and x86


-- 
2.25.1
 
https://gitee.com/openeuler/kernel/issues/I7OIXK 
 
Link:https://gitee.com/openeuler/kernel/pulls/3066

 

Reviewed-by: default avatarZucheng Zheng <zhengzucheng@huawei.com>
Reviewed-by: default avatarWei Li <liwei391@huawei.com>
Reviewed-by: default avatarLiu Chao <liuchao173@huawei.com>
Signed-off-by: default avatarJialin Zhang <zhangjialin11@huawei.com>
parents a559abee f4d6ffbf
Loading
Loading
Loading
Loading
+22 −1
Original line number Diff line number Diff line
@@ -92,7 +92,9 @@ warnings:
	buggy timer hardware through bugs in the interrupt or exception
	path (whether hardware, firmware, or software) through bugs
	in Linux's timer subsystem through bugs in the scheduler, and,
	yes, even including bugs in RCU itself.
	yes, even including bugs in RCU itself.  It can also result in
	the ``rcu_.*timer wakeup didn't happen for`` console-log message,
	which will include additional debugging information.

-	A bug in the RCU implementation.

@@ -292,6 +294,25 @@ kthread is waiting for a short timeout, the "state" precedes value of the
task_struct ->state field, and the "cpu" indicates that the grace-period
kthread last ran on CPU 5.

If the relevant grace-period kthread does not wake from FQS wait in a
reasonable time, then the following additional line is printed::

	kthread timer wakeup didn't happen for 23804 jiffies! g7076 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402

The "23804" indicates that kthread's timer expired more than 23 thousand
jiffies ago.  The rest of the line has meaning similar to the kthread
starvation case.

Additionally, the following line is printed::

	Possible timer handling issue on cpu=4 timer-softirq=11142

Here "cpu" indicates that the grace-period kthread last ran on CPU 4,
where it queued the fqs timer.  The number following the "timer-softirq"
is the current ``TIMER_SOFTIRQ`` count on cpu 4.  If this value does not
change on successive RCU CPU stall warnings, there is further reason to
suspect a timer problem.


Multiple Warnings From One Stall
================================
+6 −0
Original line number Diff line number Diff line
@@ -4751,6 +4751,12 @@
	rcupdate.rcu_cpu_stall_timeout= [KNL]
			Set timeout for RCU CPU stall warning messages.

	rcupdate.rcu_cpu_stall_cputime= [KNL]
			Provide statistics on the cputime and count of
			interrupts and tasks during the sampling period. For
			multiple continuous RCU stalls, all sampling periods
			begin at half of the first RCU stall timeout.

	rcupdate.rcu_expedited= [KNL]
			Use expedited grace-period primitives, for
			example, synchronize_rcu_expedited() instead
+1 −0
Original line number Diff line number Diff line
@@ -7299,6 +7299,7 @@ CONFIG_DEBUG_LIST=y
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_CPUTIME is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging
+1 −0
Original line number Diff line number Diff line
@@ -8352,6 +8352,7 @@ CONFIG_DEBUG_LIST=y
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_CPUTIME is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging
+12 −0
Original line number Diff line number Diff line
@@ -49,6 +49,7 @@ DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
#define kstat_cpu(cpu) per_cpu(kstat, cpu)
#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)

extern unsigned long long nr_context_switches_cpu(int cpu);
extern unsigned long long nr_context_switches(void);

extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
@@ -64,6 +65,17 @@ static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
       return kstat_cpu(cpu).softirqs[irq];
}

static inline unsigned int kstat_cpu_softirqs_sum(int cpu)
{
	int i;
	unsigned int sum = 0;

	for (i = 0; i < NR_SOFTIRQS; i++)
		sum += kstat_softirqs_cpu(i, cpu);

	return sum;
}

/*
 * Number of interrupts per specific IRQ source, since bootup
 */
Loading