Skip to content
  1. Jan 25, 2024
    • Tim Chen's avatar
      tick/sched: Preserve number of idle sleeps across CPU hotplug events · 9a574ea9
      Tim Chen authored
      Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
      CPU hotplug") preserved total idle sleep time and iowait sleeptime across
      CPU hotplug events.
      
      Similar reasoning applies to the number of idle calls and idle sleeps to
      get the proper average of sleep time per idle invocation.
      
      Preserve those fields too.
      
      Fixes: 71fee48f
      
       ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
      9a574ea9
    • Jiri Wiesner's avatar
      clocksource: Skip watchdog check for large watchdog intervals · 64464955
      Jiri Wiesner authored
      There have been reports of the watchdog marking clocksources unstable on
      machines with 8 NUMA nodes:
      
        clocksource: timekeeping watchdog on CPU373:
        Marking clocksource 'tsc' as unstable because the skew is too large:
        clocksource:   'hpet' wd_nsec: 14523447520
        clocksource:   'tsc'  cs_nsec: 14524115132
      
      The measured clocksource skew - the absolute difference between cs_nsec
      and wd_nsec - was 668 microseconds:
      
        cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
      
      The kernel used 200 microseconds for the uncertainty_margin of both the
      clocksource and watchdog, resulting in a threshold of 400 microseconds (the
      md variable). Both the cs_nsec and the wd_nsec value indicate that the
      readout interval was circa 14.5 seconds.  The observed behaviour is that
      watchdog checks failed for large readout intervals on 8 NUMA node
      machines. This indicates that the size of the skew was directly proportinal
      to the length of the readout interval on those machines. The measured
      clocksource skew, 668 microseconds, was evaluated against a threshold (the
      md variable) that is suited for readout intervals of roughly
      WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
      
      The intention of 2e27e793 ("clocksource: Reduce clocksource-skew
      threshold") was to tighten the threshold for evaluating skew and set the
      lower bound for the uncertainty_margin of clocksources to twice
      WATCHDOG_MAX_SKEW. Later in c37e85c1 ("clocksource: Loosen clocksource
      watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
      125 microseconds to fit the limit of NTP, which is able to use a
      clocksource that suffers from up to 500 microseconds of skew per second.
      Both the TSC and the HPET use default uncertainty_margin. When the
      readout interval gets stretched the default uncertainty_margin is no
      longer a suitable lower bound for evaluating skew - it imposes a limit
      that is far stricter than the skew with which NTP can deal.
      
      The root causes of the skew being directly proportinal to the length of
      the readout interval are:
      
        * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
        * the conversion to nanoseconds is imprecise for large readout intervals
      
      Prevent this by skipping the current watchdog check if the readout
      interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
      interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
      (of the TSC and HPET) corresponds to a limit on clocksource skew of 250
      ppm (microseconds of skew per second).  To keep the limit imposed by NTP
      (500 microseconds of skew per second) for all possible readout intervals,
      the margins would have to be scaled so that the threshold value is
      proportional to the length of the actual readout interval.
      
      As for why the readout interval may get stretched: Since the watchdog is
      executed in softirq context the expiration of the watchdog timer can get
      severely delayed on account of a ksoftirqd thread not getting to run in a
      timely manner. Surely, a system with such belated softirq execution is not
      working well and the scheduling issue should be looked into but the
      clocksource watchdog should be able to deal with it accordingly.
      
      Fixes: 2e27e793
      
       ("clocksource: Reduce clocksource-skew threshold")
      Suggested-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarJiri Wiesner <jwiesner@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240122172350.GA740@incl
      64464955
  2. Jan 22, 2024