Skip to content
  1. Oct 16, 2023
    • Mark Rutland's avatar
      arm64: Split kpti_install_ng_mappings() · 42c5a3b0
      Mark Rutland authored
      
      
      The arm64_cpu_capabilities::cpu_enable callbacks are intended for
      cpu-local feature enablement (e.g. poking system registers). These get
      called for each online CPU when boot/system cpucaps get finalized and
      enabled, and get called whenever a CPU is subsequently onlined.
      
      For KPTI with the ARM64_UNMAP_KERNEL_AT_EL0 cpucap, we use the
      kpti_install_ng_mappings() function as the cpu_enable callback. This
      does a mixture of cpu-local configuration (setting VBAR_EL1 to the
      appropriate trampoline vectors) and some global configuration (rewriting
      the swapper page tables to sue non-glboal mappings) that must happen at
      most once.
      
      This patch splits kpti_install_ng_mappings() into a cpu-local
      cpu_enable_kpti() initialization function and a system-wide
      kpti_install_ng_mappings() function. The cpu_enable_kpti() function is
      responsible for selecting the necessary cpu-local vectors each time a
      CPU is onlined, and the kpti_install_ng_mappings() function performs the
      one-time rewrite of the translation tables too use non-global mappings.
      Splitting the two makes the code a bit easier to follow and also allows
      the page table rewriting code to be marked as __init such that it can be
      freed after use.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      42c5a3b0
    • Mark Rutland's avatar
      arm64: Fixup user features at boot time · 7f632d33
      Mark Rutland authored
      
      
      For ARM64_WORKAROUND_2658417, we use a cpu_enable() callback to hide the
      ID_AA64ISAR1_EL1.BF16 ID register field. This is a little awkward as
      CPUs may attempt to apply the workaround concurrently, requiring that we
      protect the bulk of the callback with a raw_spinlock, and requiring some
      pointless work every time a CPU is subsequently hotplugged in.
      
      This patch makes this a little simpler by handling the masking once at
      boot time. A new user_feature_fixup() function is called at the start of
      setup_user_features() to mask the feature, matching the style of
      elf_hwcap_fixup(). The ARM64_WORKAROUND_2658417 cpucap is added to
      cpucap_is_possible() so that code can be elided entirely when this is
      not possible.
      
      Note that the ARM64_WORKAROUND_2658417 capability is matched with
      ERRATA_MIDR_RANGE(), which implicitly gives the capability a
      ARM64_CPUCAP_LOCAL_CPU_ERRATUM type, which forbids the late onlining of
      a CPU with the erratum if the erratum was not present at boot time.
      Therefore this patch doesn't change the behaviour for late onlining.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      7f632d33
    • Mark Rutland's avatar
      arm64: Rework setup_cpu_features() · 075f48c9
      Mark Rutland authored
      
      
      Currently setup_cpu_features() handles a mixture of one-time kernel
      feature setup (e.g. cpucaps) and one-time user feature setup (e.g. ELF
      hwcaps). Subsequent patches will rework other one-time setup and expand
      the logic currently in setup_cpu_features(), and in preparation for this
      it would be helpful to split the kernel and user setup into separate
      functions.
      
      This patch splits setup_user_features() out of setup_cpu_features(),
      with a few additional cleanups of note:
      
      * setup_cpu_features() is renamed to setup_system_features() to make it
        clear that it handles system-wide feature setup rather than cpu-local
        feature setup.
      
      * setup_system_capabilities() is folded into setup_system_features().
      
      * Presence of TTBR0 pan is logged immediately after
        update_cpu_capabilities(), so that this is guaranteed to appear
        alongside all the other detected system cpucaps.
      
      * The 'cwg' variable is removed as its value is only consumed once and
        it's simpler to use cache_type_cwg() directly without assigning its
        return value to a variable.
      
      * The call to setup_user_features() is moved after alternatives are
        patched, which will allow user feature setup code to depend on
        alternative branches and allow for simplifications in subsequent
        patches.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      075f48c9
    • Mark Rutland's avatar
      arm64: Add cpus_have_final_boot_cap() · 7bf46aa1
      Mark Rutland authored
      
      
      The cpus_have_final_cap() function can be used to test a cpucap while
      also verifying that we do not consume the cpucap until system
      capabilities have been finalized. It would be helpful if we could do
      likewise for boot cpucaps.
      
      This patch adds a new cpus_have_final_boot_cap() helper which can be
      used to test a cpucap while also verifying that boot capabilities have
      been finalized. Users will be added in subsequent patches.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      7bf46aa1
    • Mark Rutland's avatar
      arm64: Add cpucap_is_possible() · de66cb37
      Mark Rutland authored
      
      
      Many cpucaps can only be set when certain CONFIG_* options are selected,
      and we need to check the CONFIG_* option before the cap in order to
      avoid generating redundant code. Due to this, we have a growing number
      of helpers in <asm/cpufeature.h> of the form:
      
      | static __always_inline bool system_supports_foo(void)
      | {
      |         return IS_ENABLED(CONFIG_ARM64_FOO) &&
      |                 cpus_have_const_cap(ARM64_HAS_FOO);
      | }
      
      This is unfortunate as it forces us to use cpus_have_const_cap()
      unnecessarily, resulting in redundant code being generated by the
      compiler. In the vast majority of cases, we only require that feature
      checks indicate the presence of a feature after cpucaps have been
      finalized, and so it would be sufficient to use alternative_has_cap_*().
      However some code needs to handle a feature before alternatives have
      been patched, and must test the system_cpucaps bitmap via
      cpus_have_const_cap(). In other cases we'd like to check for
      unintentional usage of a cpucap before alternatives are patched, and so
      it would be preferable to use cpus_have_final_cap().
      
      Placing the IS_ENABLED() checks in each callsite is tedious and
      error-prone, and the same applies for writing wrappers for each
      comination of cpucap and alternative_has_cap_*() / cpus_have_cap() /
      cpus_have_final_cap(). It would be nicer if we could centralize the
      knowledge of which cpucaps are possible, and have
      alternative_has_cap_*(), cpus_have_cap(), and cpus_have_final_cap()
      handle this automatically.
      
      This patch adds a new cpucap_is_possible() function which will be
      responsible for checking the CONFIG_* option, and updates the low-level
      cpucap checks to use this. The existing CONFIG_* checks in
      <asm/cpufeature.h> are moved over to cpucap_is_possible(), but the (now
      trival) wrapper functions are retained for now.
      
      There should be no functional change as a result of this patch alone.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      de66cb37
    • Mark Rutland's avatar
      arm64: Factor out cpucap definitions · 484de085
      Mark Rutland authored
      
      
      For clarity it would be nice to factor cpucap manipulation out of
      <asm/cpufeature.h>, and the obvious place would be <asm/cpucap.h>, but
      this will clash somewhat with <generated/asm/cpucaps.h>.
      
      Rename <generated/asm/cpucaps.h> to <generated/asm/cpucap-defs.h>,
      matching what we do for <generated/asm/sysreg-defs.h>, and introduce a
      new <asm/cpucaps.h> which includes the generated header.
      
      Subsequent patches will fill out <asm/cpucaps.h>.
      
      There should be no functional change as a result of this patch.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      484de085
    • Mark Rutland's avatar
      arm64/arm: xen: enlighten: Fix KPTI checks · 20f3b8ea
      Mark Rutland authored
      
      
      When KPTI is in use, we cannot register a runstate region as XEN
      requires that this is always a valid VA, which we cannot guarantee. Due
      to this, xen_starting_cpu() must avoid registering each CPU's runstate
      region, and xen_guest_init() must avoid setting up features that depend
      upon it.
      
      We tried to ensure that in commit:
      
        f88af722 (" xen/arm: do not setup the runstate info page if kpti is enabled")
      
      ... where we added checks for xen_kernel_unmapped_at_usr(), which wraps
      arm64_kernel_unmapped_at_el0() on arm64 and is always false on 32-bit
      arm.
      
      Unfortunately, as xen_guest_init() is an early_initcall, this happens
      before secondary CPUs are booted and arm64 has finalized the
      ARM64_UNMAP_KERNEL_AT_EL0 cpucap which backs
      arm64_kernel_unmapped_at_el0(), and so this can subsequently be set as
      secondary CPUs are onlined. On a big.LITTLE system where the boot CPU
      does not require KPTI but some secondary CPUs do, this will result in
      xen_guest_init() intializing features that depend on the runstate
      region, and xen_starting_cpu() registering the runstate region on some
      CPUs before KPTI is subsequent enabled, resulting the the problems the
      aforementioned commit tried to avoid.
      
      Handle this more robsutly by deferring the initialization of the
      runstate region until secondary CPUs have been initialized and the
      ARM64_UNMAP_KERNEL_AT_EL0 cpucap has been finalized. The per-cpu work is
      moved into a new hotplug starting function which is registered later
      when we're certain that KPTI will not be used.
      
      Fixes: f88af722 ("xen/arm: do not setup the runstate info page if kpti is enabled")
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Bertrand Marquis <bertrand.marquis@arm.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      20f3b8ea
    • Mark Rutland's avatar
      clocksource/drivers/arm_arch_timer: Initialize evtstrm after finalizing cpucaps · 166b76a0
      Mark Rutland authored
      
      
      We attempt to initialize each CPU's arch_timer event stream in
      arch_timer_evtstrm_enable(), which we call from the
      arch_timer_starting_cpu() cpu hotplug callback which is registered early
      in boot. As this is registered before we initialize the system cpucaps,
      the test for ARM64_HAS_ECV will always be false for CPUs present at boot
      time, and will only be taken into account for CPUs onlined late
      (including those which are hotplugged out and in again).
      
      Due to this, CPUs present and boot time may not use the intended divider
      and scale factor to generate the event stream, and may differ from other
      CPUs.
      
      Correct this by only initializing the event stream after cpucaps have been
      finalized, registering a separate CPU hotplug callback for the event stream
      configuration. Since the caps must be finalized by this point, use
      cpus_have_final_cap() to verify this.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      166b76a0
  2. Sep 25, 2023
    • Linus Torvalds's avatar
      Linux 6.6-rc3 · 6465e260
      Linus Torvalds authored
      6465e260
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 8a511e7e
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
      "ARM:
      
         - Fix EL2 Stage-1 MMIO mappings where a random address was used
      
         - Fix SMCCC function number comparison when the SVE hint is set
      
        RISC-V:
      
         - Fix KVM_GET_REG_LIST API for ISA_EXT registers
      
         - Fix reading ISA_EXT register of a missing extension
      
         - Fix ISA_EXT register handling in get-reg-list test
      
         - Fix filtering of AIA registers in get-reg-list test
      
        x86:
      
         - Fixes for TSC_AUX virtualization
      
         - Stop zapping page tables asynchronously, since we don't zap them as
           often as before"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX
        KVM: SVM: Fix TSC_AUX virtualization setup
        KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
        KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
        KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()
        KVM: x86/mmu: Open code leaf invalidation from mmu_notifier
        KVM: riscv: selftests: Selectively filter-out AIA registers
        KVM: riscv: selftests: Fix ISA_EXT register handling in get-reg-list
        RISC-V: KVM: Fix riscv_vcpu_get_isa_ext_single() for missing extensions
        RISC-V: KVM: Fix KVM_GET_REG_LIST API for ISA_EXT registers
        KVM: selftests: Assert that vasprintf() is successful
        KVM: arm64: nvhe: Ignore SVE hint in SMCCC function ID
        KVM: arm64: Properly return allocated EL2 VA from hyp_alloc_private_va_range()
      8a511e7e
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 5edc6bb3
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix the "bytes" output of the per_cpu stat file
      
         The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
         accounting was not accurate. It is suppose to show how many used
         bytes are still in the ring buffer, but even when the ring buffer was
         empty it would still show there were bytes used.
      
       - Fix a bug in eventfs where reading a dynamic event directory (open)
         and then creating a dynamic event that goes into that diretory screws
         up the accounting.
      
         On close, the newly created event dentry will get a "dput" without
         ever having a "dget" done for it. The fix is to allocate an array on
         dir open to save what dentries were actually "dget" on, and what ones
         to "dput" on close.
      
      * tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        eventfs: Remember what dentries were created on dir open
        ring-buffer: Fix bytes info in per_cpu buffer stats
      5edc6bb3
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 2ad78f8c
      Linus Torvalds authored
      Pull cxl fixes from Dan Williams:
       "A collection of regression fixes, bug fixes, and some small cleanups
        to the Compute Express Link code.
      
        The regressions arrived in the v6.5 dev cycle and missed the v6.6
        merge window due to my personal absences this cycle. The most
        important fixes are for scenarios where the CXL subsystem fails to
        parse valid region configurations established by platform firmware.
        This is important because agreement between OS and BIOS on the CXL
        configuration is fundamental to implementing "OS native" error
        handling, i.e. address translation and component failure
        identification.
      
        Other important fixes are a driver load error when the BIOS lets the
        Linux PCI core handle AER events, but not CXL memory errors.
      
        The other fixex might have end user impact, but for now are only known
        to trigger in our test/emulation environment.
      
        Summary:
      
         - Fix multiple scenarios where platform firmware defined regions fail
           to be assembled by the CXL core.
      
         - Fix a spurious driver-load failure on platforms that enable OS
           native AER, but not OS native CXL error handling.
      
         - Fix a regression detecting "poison" commands when "security"
           commands are also defined.
      
         - Fix a cxl_test regression with the move to centralize CXL port
           register enumeration in the CXL core.
      
         - Miscellaneous small fixes and cleanups"
      
      * tag 'cxl-fixes-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/acpi: Annotate struct cxl_cxims_data with __counted_by
        cxl/port: Fix cxl_test register enumeration regression
        cxl/region: Refactor granularity select in cxl_port_setup_targets()
        cxl/region: Match auto-discovered region decoders by HPA range
        cxl/mbox: Fix CEL logic for poison and security commands
        cxl/pci: Replace host_bridge->native_aer with pcie_aer_is_native()
        PCI/AER: Export pcie_aer_is_native()
        cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers
      2ad78f8c
  3. Sep 24, 2023
  4. Sep 23, 2023
    • Paolo Bonzini's avatar
      Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into HEAD · 5804c19b
      Paolo Bonzini authored
      KVM/riscv fixes for 6.6, take #1
      
      - Fix KVM_GET_REG_LIST API for ISA_EXT registers
      - Fix reading ISA_EXT register of a missing extension
      - Fix ISA_EXT register handling in get-reg-list test
      - Fix filtering of AIA registers in get-reg-list test
      5804c19b
    • Tom Lendacky's avatar
      KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX · 916e3e5f
      Tom Lendacky authored
      
      
      When the TSC_AUX MSR is virtualized, the TSC_AUX value is swap type "B"
      within the VMSA. This means that the guest value is loaded on VMRUN and
      the host value is restored from the host save area on #VMEXIT.
      
      Since the value is restored on #VMEXIT, the KVM user return MSR support
      for TSC_AUX can be replaced by populating the host save area with the
      current host value of TSC_AUX. And, since TSC_AUX is not changed by Linux
      post-boot, the host save area can be set once in svm_hardware_enable().
      This eliminates the two WRMSR instructions associated with the user return
      MSR support.
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <d381de38eb0ab6c9c93dda8503b72b72546053d7.1694811272.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      916e3e5f
    • Tom Lendacky's avatar
      KVM: SVM: Fix TSC_AUX virtualization setup · e0096d01
      Tom Lendacky authored
      
      
      The checks for virtualizing TSC_AUX occur during the vCPU reset processing
      path. However, at the time of initial vCPU reset processing, when the vCPU
      is first created, not all of the guest CPUID information has been set. In
      this case the RDTSCP and RDPID feature support for the guest is not in
      place and so TSC_AUX virtualization is not established.
      
      This continues for each vCPU created for the guest. On the first boot of
      an AP, vCPU reset processing is executed as a result of an APIC INIT
      event, this time with all of the guest CPUID information set, resulting
      in TSC_AUX virtualization being enabled, but only for the APs. The BSP
      always sees a TSC_AUX value of 0 which probably went unnoticed because,
      at least for Linux, the BSP TSC_AUX value is 0.
      
      Move the TSC_AUX virtualization enablement out of the init_vmcb() path and
      into the vcpu_after_set_cpuid() path to allow for proper initialization of
      the support after the guest CPUID information has been set.
      
      With the TSC_AUX virtualization support now in the vcpu_set_after_cpuid()
      path, the intercepts must be either cleared or set based on the guest
      CPUID input.
      
      Fixes: 296d5a17 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <4137fbcb9008951ab5f0befa74a0399d2cce809a.1694811272.git.thomas.lendacky@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0096d01
    • Paolo Bonzini's avatar
      KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway · e8d93d5d
      Paolo Bonzini authored
      
      
      svm_recalc_instruction_intercepts() is always called at least once
      before the vCPU is started, so the setting or clearing of the RDTSCP
      intercept can be dropped from the TSC_AUX virtualization support.
      
      Extracted from a patch by Tom Lendacky.
      
      Cc: stable@vger.kernel.org
      Fixes: 296d5a17 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8d93d5d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously · 0df9dab8
      Sean Christopherson authored
      
      
      Stop zapping invalidate TDP MMU roots via work queue now that KVM
      preserves TDP MMU roots until they are explicitly invalidated.  Zapping
      roots asynchronously was effectively a workaround to avoid stalling a vCPU
      for an extended during if a vCPU unloaded a root, which at the time
      happened whenever the guest toggled CR0.WP (a frequent operation for some
      guest kernels).
      
      While a clever hack, zapping roots via an unbound worker had subtle,
      unintended consequences on host scheduling, especially when zapping
      multiple roots, e.g. as part of a memslot.  Because the work of zapping a
      root is no longer bound to the task that initiated the zap, things like
      the CPU affinity and priority of the original task get lost.  Losing the
      affinity and priority can be especially problematic if unbound workqueues
      aren't affined to a small number of CPUs, as zapping multiple roots can
      cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
      the CPUs KVM is already using to run vCPUs.
      
      When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
      zap can result in KVM occupying all logical CPUs for ~8ms, and result in
      high priority tasks not being scheduled in in a timely manner.  In v5.15,
      which doesn't preserve unloaded roots, the issues were even more noticeable
      as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
      
      Consuming all CPUs for an extended duration can lead to significant jitter
      throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
      is a semi-frequent operation as memslots are deleted and recreated with
      different host virtual addresses to react to host GPU drivers allocating
      and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
      during games due to the audio server's tasks not getting scheduled in
      promptly, despite the tasks having a high realtime priority.
      
      Deleting memslots isn't exactly a fast path and should be avoided when
      possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
      memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
      that removing the async zapping eliminates a non-trivial amount of
      complexity.
      
      Note, one of the subtle behaviors hidden behind the async zapping is that
      KVM would zap invalidated roots only once (ignoring partial zaps from
      things like mmu_notifier events).  Preserve this behavior by adding a flag
      to identify roots that are scheduled to be zapped versus roots that have
      already been zapped but not yet freed.
      
      Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
      encounter invalid roots, as it's not at all obvious why zapping
      invalidated roots shouldn't simply zap all invalid roots.
      
      Reported-by: default avatarPattara Teerapong <pteerapong@google.com>
      Cc: David Stevens <stevensd@google.com>
      Cc: Yiwei Zhang<zzyiwei@google.com>
      Cc: Paul Hsia <paulhsia@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230916003916.2545000-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0df9dab8
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() · 441a5dfc
      Paolo Bonzini authored
      
      
      All callers except the MMU notifier want to process all address spaces.
      Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe()
      and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe().
      
      Extracted out of a patch by Sean Christopherson <seanjc@google.com>
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      441a5dfc
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · d90b0276
      Linus Torvalds authored
      Pull hardening fixes from Kees Cook:
      
       - Fix UAPI stddef.h to avoid C++-ism (Alexey Dobriyan)
      
       - Fix harmless UAPI stddef.h header guard endif (Alexey Dobriyan)
      
      * tag 'hardening-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        uapi: stddef.h: Fix __DECLARE_FLEX_ARRAY for C++
        uapi: stddef.h: Fix header guard location
      d90b0276
    • Linus Torvalds's avatar
      Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 3abc79dc
      Linus Torvalds authored
      Pull xfs fixes from Chandan Babu:
      
       - Fix an integer overflow bug when processing an fsmap call
      
       - Fix crash due to CPU hot remove event racing with filesystem mount
         operation
      
       - During read-only mount, XFS does not allow the contents of the log to
         be recovered when there are one or more unrecognized rcompat features
         in the primary superblock, since the log might have intent items
         which the kernel does not know how to process
      
       - During recovery of log intent items, XFS now reserves log space
         sufficient for one cycle of a permanent transaction to execute.
         Otherwise, this could lead to livelocks due to non-availability of
         log space
      
       - On an fs which has an ondisk unlinked inode list, trying to delete a
         file or allocating an O_TMPFILE file can cause the fs to the shutdown
         if the first inode in the ondisk inode list is not present in the
         inode cache. The bug is solved by explicitly loading the first inode
         in the ondisk unlinked inode list into the inode cache if it is not
         already cached
      
         A similar problem arises when the uncached inode is present in the
         middle of the ondisk unlinked inode list. This second bug is
         triggered when executing operations like quotacheck and bulkstat. In
         this case, XFS now reads in the entire ondisk unlinked inode list
      
       - Enable LARP mode only on recent v5 filesystems
      
       - Fix a out of bounds memory access in scrub
      
       - Fix a performance bug when locating the tail of the log during
         mounting a filesystem
      
      * tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
        xfs: only call xchk_stats_merge after validating scrub inputs
        xfs: require a relatively recent V5 filesystem for LARP mode
        xfs: make inode unlinked bucket recovery work with quotacheck
        xfs: load uncached unlinked inodes into memory on demand
        xfs: reserve less log space when recovering log intent items
        xfs: fix log recovery when unknown rocompat bits are set
        xfs: reload entire unlinked bucket lists
        xfs: allow inode inactivation during a ro mount log recovery
        xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
        xfs: remove CPU hotplug infrastructure
        xfs: remove the all-mounts list
        xfs: use per-mount cpumask to track nonempty percpu inodegc lists
        xfs: fix an agbno overflow in __xfs_getfsmap_datadev
        xfs: fix per-cpu CIL structure aggregation racing with dying cpus
        xfs: fix select in config XFS_ONLINE_SCRUB_STATS
      3abc79dc
    • Kees Cook's avatar
      cxl/acpi: Annotate struct cxl_cxims_data with __counted_by · c66650d2
      Kees Cook authored
      Prepare for the coming implementation by GCC and Clang of the __counted_by
      attribute. Flexible array members annotated with __counted_by can have
      their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
      (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
      functions).
      
      As found with Coccinelle[1], add __counted_by for struct cxl_cxims_data.
      Additionally, since the element count member must be set before accessing
      the annotated flexible array member, move its initialization earlier.
      
      [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
      
      
      
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Alison Schofield <alison.schofield@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-cxl@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/20230922175319.work.096-kees@kernel.org
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      c66650d2
    • Dan Williams's avatar
      cxl/port: Fix cxl_test register enumeration regression · a76b6251
      Dan Williams authored
      
      
      The cxl_test unit test environment models a CXL topology for
      sysfs/user-ABI regression testing. It uses interface mocking via the
      "--wrap=" linker option to redirect cxl_core routines that parse
      hardware registers with versions that just publish objects, like
      devm_cxl_enumerate_decoders().
      
      Starting with:
      
      Commit 19ab69a6 ("cxl/port: Store the port's Component Register mappings in struct cxl_port")
      
      ...port register enumeration is moved into devm_cxl_add_port(). This
      conflicts with the "cxl_test avoids emulating registers stance" so
      either the port code needs to be refactored (too violent), or modified
      so that register enumeration is skipped on "fake" cxl_test ports
      (annoying, but straightforward).
      
      This conflict has happened previously and the "check for platform
      device" workaround to avoid instrusive refactoring was deployed in those
      scenarios. In general, refactoring should only benefit production code,
      test code needs to remain minimally instrusive to the greatest extent
      possible.
      
      This was missed previously because it may sometimes just cause warning
      messages to be emitted, but it can also cause test failures. The
      backport to -stable is only nice to have for clean cxl_test runs.
      
      Fixes: 19ab69a6 ("cxl/port: Store the port's Component Register mappings in struct cxl_port")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAlison Schofield <alison.schofield@intel.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Tested-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/169476525052.1013896.6235102957693675187.stgit@dwillia2-xfh.jf.intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      a76b6251
    • Steven Rostedt (Google)'s avatar
      eventfs: Remember what dentries were created on dir open · ef36b4f9
      Steven Rostedt (Google) authored
      Using the following code with libtracefs:
      
      	int dfd;
      
      	// create the directory events/kprobes/kp1
      	tracefs_kprobe_raw(NULL, "kp1", "schedule_timeout", "time=$arg1");
      
      	// Open the kprobes directory
      	dfd = tracefs_instance_file_open(NULL, "events/kprobes", O_RDONLY);
      
      	// Do a lookup of the kprobes/kp1 directory (by looking at enable)
      	tracefs_file_exists(NULL, "events/kprobes/kp1/enable");
      
      	// Now create a new entry in the kprobes directory
      	tracefs_kprobe_raw(NULL, "kp2", "schedule_hrtimeout", "expires=$arg1");
      
      	// Do another lookup to create the dentries
      	tracefs_file_exists(NULL, "events/kprobes/kp2/enable"))
      
      	// Close the directory
      	close(dfd);
      
      What happened above, the first open (dfd) will call
      dcache_dir_open_wrapper() that will create the dentries and up their ref
      counts.
      
      Now the creation of "kp2" will add another dentry within the kprobes
      directory.
      
      Upon the close of dfd, eventfs_release() will now do a dput for all the
      entries in kprobes. But this is where the problem lies. The open only
      upped the dentry of kp1 and not kp2. Now the close is decrementing both
      kp1 and kp2, which causes kp2 to get a negative count.
      
      Doing a "trace-cmd reset" which deletes all the kprobes cause the kernel
      to crash! (due to the messed up accounting of the ref counts).
      
      To solve this, save all the dentries that are opened in the
      dcache_dir_open_wrapper() into an array, and use this array to know what
      dentries to do a dput on in eventfs_release().
      
      Since the dcache_dir_open_wrapper() calls dcache_dir_open() which uses the
      file->private_data, we need to also add a wrapper around dcache_readdir()
      that uses the cursor assigned to the file->private_data. This is because
      the dentries need to also be saved in the file->private_data. To do this
      create the structure:
      
        struct dentry_list {
      	void		*cursor;
      	struct dentry	**dentries;
        };
      
      Which will hold both the cursor and the dentries. Some shuffling around is
      needed to make sure that dcache_dir_open() and dcache_readdir() only see
      the cursor.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home
      
      
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Ajay Kaher <akaher@vmware.com>
      Fixes: 63940449 ("eventfs: Implement eventfs lookup, read, open functions")
      Reported-by: default avatar"Masami Hiramatsu (Google)" <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      ef36b4f9
    • Zheng Yejian's avatar
      ring-buffer: Fix bytes info in per_cpu buffer stats · 45d99ea4
      Zheng Yejian authored
      The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of
      bytes in cpu buffer that have not been consumed. However, currently
      after consuming data by reading file 'trace_pipe', the 'bytes' info
      was not changed as expected.
      
        # cat per_cpu/cpu0/stats
        entries: 0
        overrun: 0
        commit overrun: 0
        bytes: 568             <--- 'bytes' is problematical !!!
        oldest event ts:  8651.371479
        now ts:  8653.912224
        dropped events: 0
        read events: 8
      
      The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it:
        1. When stat 'read_bytes', account consumed event in rb_advance_reader();
        2. When stat 'entries_bytes', exclude the discarded padding event which
           is smaller than minimum size because it is invisible to reader. Then
           use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for
           page-based read/remove/overrun.
      
      Also correct the comments of ring_buffer_bytes_cpu() in this patch.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: c64e148a ("trace: Add ring buffer stats to measure rate of events")
      Signed-off-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      45d99ea4
    • Linus Torvalds's avatar
      Merge tag 'thermal-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 8018e02a
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Unbreak the trip point update sysfs interface that has been broken
        since the 6.3 cycle (Rafael Wysocki)"
      
      * tag 'thermal-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: sysfs: Fix trip_point_hyst_store()
      8018e02a
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · b184c040
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "These fix a general ACPI processor driver regression and an ia64 build
        issue, both introduced recently.
      
        Specifics:
      
         - Fix recently introduced uninitialized memory access issue in the
           ACPI processor driver (Michal Wilczynski)
      
         - Fix ia64 build inadvertently broken by recent ACPI processor driver
           changes, which is prudent to do for 6.6 even though ia64 support is
           slated for removal in 6.7 (Ard Biesheuvel)"
      
      * tag 'acpi-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: processor: Fix uninitialized access of buf in acpi_set_pdc_bits()
        acpi: Provide ia64 dummy implementation of acpi_proc_quirk_mwait_check()
      b184c040
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 36fcf381
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Small crop of relatively boring arm64 fixes for -rc3.
      
        That's not to say we don't have any juicy bugs, however, it's just
        that fixes for those are likely to come via -mm and -tip for a hugetlb
        and an atomics issue respectively. I get left with the
        documentation...
      
         - Fix detection of "ClearBHB" and "Hinted Conditional Branch" features
      
         - Fix broken wildcarding for Arm PMU MAINTAINERS entry
      
         - Add missing documentation for userspace-visible ID register fields"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Document missing userspace visible fields in ID_AA64ISAR2_EL1
        arm64/hbc: Document HWCAP2_HBC
        arm64/sme: Include ID_AA64PFR1_EL1.SME in cpu-feature-registers.rst
        arm64: cpufeature: Fix CLRBHB and BC detection
        MAINTAINERS: Use wildcard pattern for ARM PMU headers
      36fcf381
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b61ec8d0
      Linus Torvalds authored
      Pull x86 rethunk fixes from Borislav Petkov:
       "Fix the patching ordering between static calls and return thunks"
      
      * tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86,static_call: Fix static-call vs return-thunk
        x86/alternatives: Remove faulty optimization
      b61ec8d0
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e583bffe
      Linus Torvalds authored
      Pull misc x86 fixes from Ingo Molnar:
      
       - Fix a kexec bug
      
       - Fix an UML build bug
      
       - Fix a handful of SRSO related bugs
      
       - Fix a shadow stacks handling bug & robustify related code
      
      * tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/shstk: Add warning for shadow stack double unmap
        x86/shstk: Remove useless clone error handling
        x86/shstk: Handle vfork clone failure correctly
        x86/srso: Fix SBPB enablement for spec_rstack_overflow=off
        x86/srso: Don't probe microcode in a guest
        x86/srso: Set CPUID feature bits independently of bug or mitigation status
        x86/srso: Fix srso_show_state() side effect
        x86/asm: Fix build of UML with KASAN
        x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
      e583bffe
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5b47b576
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "Fix a PF_IDLE initialization bug that generated warnings on tiny-RCU"
      
      * tag 'sched-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        kernel/sched: Modify initial boot task idle setup
      5b47b576
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 725e2d7e
      Linus Torvalds authored
      Pull locking fixes from Ingo Molnar:
       "Fix a include/linux/atomic/atomic-arch-fallback.h breakage that
        generated incorrect code, and fix a lockdep reporting race that may
        result in lockups"
      
      * tag 'locking-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
        locking/atomic: scripts: fix fallback ifdeffery
      725e2d7e
    • Peter Zijlstra's avatar
      x86,static_call: Fix static-call vs return-thunk · aee9d30b
      Peter Zijlstra authored
      
      
      Commit
      
        7825451f ("static_call: Add call depth tracking support")
      
      failed to realize the problem fixed there is not specific to call depth
      tracking but applies to all return-thunk uses.
      
      Move the fix to the appropriate place and condition.
      
      Fixes: ee88d363 ("x86,static_call: Use alternative RET encoding")
      Reported-by: default avatarDavid Kaplan <David.Kaplan@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Cc: <stable@kernel.org>
      aee9d30b