Skip to content
  1. Aug 02, 2023
  2. Aug 01, 2023
  3. Jul 31, 2023
    • Janusz Krzysztofik's avatar
      drm/i915: Fix premature release of request's reusable memory · 946e047a
      Janusz Krzysztofik authored
      Infinite waits for completion of GPU activity have been observed in CI,
      mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or
      igt@perf@stress-open-close.  Root cause analysis, based of ftrace dumps
      generated with a lot of extra trace_printk() calls added to the code,
      revealed loops of request dependencies being accidentally built,
      preventing the requests from being processed, each waiting for completion
      of another one's activity.
      
      After we substitute a new request for a last active one tracked on a
      timeline, we set up a dependency of our new request to wait on completion
      of current activity of that previous one.  While doing that, we must take
      care of keeping the old request still in memory until we use its
      attributes for setting up that await dependency, or we can happen to set
      up the await dependency on an unrelated request that already reuses the
      memory previously allocated to the old one, already released.  Combined
      with perf adding consecutive kernel context remote requests to different
      user context timelines, unresolvable loops of await dependencies can be
      built, leading do infinite waits.
      
      We obtain a pointer to the previous request to wait upon when we
      substitute it with a pointer to our new request in an active tracker,
      e.g. in intel_timeline.last_request.  In some processing paths we protect
      that old request from being freed before we use it by getting a reference
      to it under RCU protection, but in others, e.g.  __i915_request_commit()
      -> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(),
      we don't.  But anyway, since the requests' memory is SLAB_FAILSAFE_BY_RCU,
      that RCU protection is not sufficient against reuse of memory.
      
      We could protect i915_request's memory from being prematurely reused by
      calling its release function via call_rcu() and using rcu_read_lock()
      consequently, as proposed in v1.  However, that approach leads to
      significant (up to 10 times) increase of SLAB utilization by i915_request
      SLAB cache.  Another potential approach is to take a reference to the
      previous active fence.
      
      When updating an active fence tracker, we first lock the new fence,
      substitute a pointer of the current active fence with the new one, then we
      lock the substituted fence.  With this approach, there is a time window
      after the substitution and before the lock when the request can be
      concurrently released by an interrupt handler and its memory reused, then
      we may happen to lock and return a new, unrelated request.
      
      Always get a reference to the current active fence first, before
      replacing it with a new one.  Having it protected from premature release
      and reuse, lock it and then replace with the new one but only if not
      yet signalled via a potential concurrent interrupt nor replaced with
      another one by a potential concurrent thread, otherwise retry, starting
      from getting a reference to the new current one.  Adjust users to not
      get a reference to the previous active fence themselves and always put the
      reference got by __i915_active_fence_set() when no longer needed.
      
      v3: Fix lockdep splat reports and other issues caused by incorrect use of
          try_cmpxchg() (use (cmpxchg() != prev) instead)
      v2: Protect request's memory by getting a reference to it in favor of
          delegating its release to call_rcu() (Chris)
      
      Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211
      Fixes: df9f85d8
      
       ("drm/i915: Serialise i915_active_fence_set() with itself")
      Suggested-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: <stable@vger.kernel.org> # v5.6+
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20230720093543.832147-2-janusz.krzysztofik@linux.intel.com
      946e047a
  4. Jul 28, 2023
    • Alan Previn's avatar
      drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests · b1cef13e
      Alan Previn authored
      
      
      On MTL, if the GSC Proxy init flows haven't completed, submissions to the
      GSC engine will fail. Those init flows are dependent on the mei's
      gsc_proxy component that is loaded in parallel with i915 and a
      worker that could potentially start after i915 driver init is done.
      
      That said, all subsytems that access the GSC engine today does check
      for such init flow completion before using the GSC engine. However,
      selftests currently don't wait on anything before starting.
      
      To fix this, add a waiter function at the start of __run_selftests
      that waits for gsc-proxy init flows to complete. Selftests shouldn't
      care if the proxy-init failed as that should be flagged elsewhere.
      
      Difference from prior versions:
         v7: - Change the fw status to INTEL_UC_FIRMWARE_LOAD_FAIL if the
               proxy-init fails so that intel_gsc_uc_fw_proxy_get_status
               catches it. (Daniele)
         v6: - Add a helper that returns something more than a boolean
               so we selftest can stop waiting if proxy-init hadn't
               completed but failed (Daniele).
         v5: - Move the call to __wait_gsc_proxy_completed from common
               __run_selftests dispatcher to the group-level selftest
               function (Trvtko).
             - change the pr_info to pr_warn if we hit the timeout.
         v4: - Remove generalized waiters function table framework (Tvrtko).
             - Remove mention of CI-framework-timeout from comments (Tvrtko).
         v3: - Rebase to latest drm-tip.
         v2: - Based on internal testing, increase the timeout for gsc-proxy
               specific case to 8 seconds.
      
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Reviewed-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Signed-off-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20230720230126.375566-1-alan.previn.teres.alexis@intel.com
      b1cef13e
  5. Jul 26, 2023
  6. Jul 25, 2023
  7. Jul 24, 2023
  8. Jul 21, 2023
  9. Jul 19, 2023
  10. Jul 17, 2023
  11. Jul 13, 2023
  12. Jul 12, 2023
  13. Jul 10, 2023
  14. Jul 08, 2023
  15. Jul 07, 2023
  16. Jul 06, 2023
  17. Jul 04, 2023
  18. Jun 26, 2023
  19. Jun 24, 2023
  20. Jun 22, 2023