Commit 7a21ddf0 authored by Zhen Lei's avatar Zhen Lei Committed by Paul E. McKenney
Browse files

doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information



This commit documents the additional RCU CPU stall warning output
produced by kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted
with rcupdate.rcu_cpu_stall_cputime=1.

[ paulmck: Apply wordsmithing. ]

Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
parent 148750d7
Loading
Loading
Loading
Loading
+88 −0
Original line number Diff line number Diff line
@@ -390,3 +390,91 @@ for example, "P3421".

It is entirely possible to see stall warnings from normal and from
expedited grace periods at about the same time during the same run.

RCU_CPU_STALL_CPUTIME
=====================

In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
rcupdate.rcu_cpu_stall_cputime=1, the following additional information
is supplied with each RCU CPU stall warning::

rcu:          hardirqs   softirqs   csw/system
rcu:  number:      624         45            0
rcu: cputime:       69          1         2425   ==> 2500(ms)

These statistics are collected during the sampling period. The values
in row "number:" are the number of hard interrupts, number of soft
interrupts, and number of context switches on the stalled CPU. The
first three values in row "cputime:" indicate the CPU time in
milliseconds consumed by hard interrupts, soft interrupts, and tasks
on the stalled CPU.  The last number is the measurement interval, again
in milliseconds.  Because user-mode tasks normally do not cause RCU CPU
stalls, these tasks are typically kernel tasks, which is why only the
system CPU time are considered.

The sampling period is shown as follows:
:<------------first timeout---------->:<-----second timeout----->:
:<--half timeout-->:<--half timeout-->:                          :
:                  :<--first period-->:                          :
:                  :<-----------second sampling period---------->:
:                  :                  :                          :
:          snapshot time point    1st-stall                  2nd-stall


The following describes four typical scenarios:

1. A CPU looping with interrupts disabled.::

   rcu:          hardirqs   softirqs   csw/system
   rcu:  number:        0          0            0
   rcu: cputime:        0          0            0   ==> 2500(ms)

   Because interrupts have been disabled throughout the measurement
   interval, there are no interrupts and no context switches.
   Furthermore, because CPU time consumption was measured using interrupt
   handlers, the system CPU consumption is misleadingly measured as zero.
   This scenario will normally also have "(0 ticks this GP)" printed on
   this CPU's summary line.

2. A CPU looping with bottom halves disabled.

   This is similar to the previous example, but with non-zero number of
   and CPU time consumed by hard interrupts, along with non-zero CPU
   time consumed by in-kernel execution.::

   rcu:          hardirqs   softirqs   csw/system
   rcu:  number:      624          0            0
   rcu: cputime:       49          0         2446   ==> 2500(ms)

   The fact that there are zero softirqs gives a hint that these were
   disabled, perhaps via local_bh_disable().  It is of course possible
   that there were no softirqs, perhaps because all events that would
   result in softirq execution are confined to other CPUs.  In this case,
   the diagnosis should continue as shown in the next example.

3. A CPU looping with preemption disabled.

   Here, only the number of context switches is zero.::

   rcu:          hardirqs   softirqs   csw/system
   rcu:  number:      624         45            0
   rcu: cputime:       69          1         2425   ==> 2500(ms)

   This situation hints that the stalled CPU was looping with preemption
   disabled.

4. No looping, but massive hard and soft interrupts.::

   rcu:          hardirqs   softirqs   csw/system
   rcu:  number:       xx         xx            0
   rcu: cputime:       xx         xx            0   ==> 2500(ms)

   Here, the number and CPU time of hard interrupts are all non-zero,
   but the number of context switches and the in-kernel CPU time consumed
   are zero. The number and cputime of soft interrupts will usually be
   non-zero, but could be zero, for example, if the CPU was spinning
   within a single hard interrupt handler.

   If this type of RCU CPU stall warning can be reproduced, you can
   narrow it down by looking at /proc/interrupts or by writing code to
   trace each interrupt, for example, by referring to show_interrupts().