Skip to content
  1. Apr 20, 2021
    • Sean Christopherson's avatar
      KVM: SVM: Drop vcpu_svm.vmcb_pa · d1788191
      Sean Christopherson authored
      
      
      Remove vmcb_pa from vcpu_svm and simply read current_vmcb->pa directly in
      the one path where it is consumed.  Unlike svm->vmcb, use of the current
      vmcb's address is very limited, as evidenced by the fact that its use
      can be trimmed to a single dereference.
      
      Opportunistically add a comment about using vmcb01 for VMLOAD/VMSAVE, at
      first glance using vmcb01 instead of vmcb_pa looks wrong.
      
      No functional change intended.
      
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210406171811.4043363-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d1788191
    • Sean Christopherson's avatar
      KVM: SVM: Don't set current_vmcb->cpu when switching vmcb · 17e5e964
      Sean Christopherson authored
      Do not update the new vmcb's last-run cpu when switching to a different
      vmcb.  If the vCPU is migrated between its last run and a vmcb switch,
      e.g. for nested VM-Exit, then setting the cpu without marking the vmcb
      dirty will lead to KVM running the vCPU on a different physical cpu with
      stale clean bit settings.
      
                                vcpu->cpu    current_vmcb->cpu    hardware
        pre_svm_run()           cpu0         cpu0                 cpu0,clean
        kvm_arch_vcpu_load()    cpu1         cpu0                 cpu0,clean
        svm_switch_vmcb()       cpu1         cpu1                 cpu0,clean
        pre_svm_run()           cpu1         cpu1                 kaboom
      
      Simply delete the offending code; unlike VMX, which needs to update the
      cpu at switch time due to the need to do VMPTRLD, SVM only cares about
      which cpu last ran the vCPU.
      
      Fixes: af18fa77
      
       ("KVM: nSVM: Track the physical cpu of the vmcb vmrun through the vmcb")
      Cc: Cathy Avery <cavery@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210406171811.4043363-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17e5e964
    • Tom Lendacky's avatar
      KVM: SVM: Make sure GHCB is mapped before updating · a3ba26ec
      Tom Lendacky authored
      Access to the GHCB is mainly in the VMGEXIT path and it is known that the
      GHCB will be mapped. But there are two paths where it is possible the GHCB
      might not be mapped.
      
      The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
      the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
      However, if a SIPI is performed without a corresponding AP Reset Hold,
      then the GHCB might not be mapped (depending on the previous VMEXIT),
      which will result in a NULL pointer dereference.
      
      The svm_complete_emulated_msr() routine will update the GHCB to inform
      the caller of a RDMSR/WRMSR operation about any errors. While it is likely
      that the GHCB will be mapped in this situation, add a safe guard
      in this path to be certain a NULL pointer dereference is not encountered.
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Fixes: 647daca2
      
       ("KVM: SVM: Add support for booting APs in an SEV-ES guest")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Cc: stable@vger.kernel.org
      Message-Id: <a5d3ebb600a91170fc88599d5a575452b3e31036.1617979121.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a3ba26ec
    • Wanpeng Li's avatar
      KVM: X86: Do not yield to self · a1fa4cbd
      Wanpeng Li authored
      
      
      If the target is self we do not need to yield, we can avoid malicious
      guest to play this.
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1617941911-5338-3-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a1fa4cbd
    • Wanpeng Li's avatar
      KVM: X86: Count attempted/successful directed yield · 4a7132ef
      Wanpeng Li authored
      
      
      To analyze some performance issues with lock contention and scheduling,
      it is nice to know when directed yield are successful or failing.
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a7132ef
    • Wanpeng Li's avatar
      x86/kvm: Don't bother __pv_cpu_mask when !CONFIG_SMP · 2b519b57
      Wanpeng Li authored
      
      
      Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's
      move it inside CONFIG_SMP. In addition, we can avoid define and
      alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable
      in kvm_alloc_cpumask.
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1617941911-5338-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b519b57
    • Ben Gardon's avatar
      KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns · 4c6654bd
      Ben Gardon authored
      
      
      To avoid saddling a vCPU thread with the work of tearing down an entire
      paging structure, take a reference on each root before they become
      obsolete, so that the thread initiating the fast invalidation can tear
      down the paging structure and (most likely) release the last reference.
      As a bonus, this teardown can happen under the MMU lock in read mode so
      as not to block the progress of vCPU threads.
      
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210401233736.638171-14-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4c6654bd
    • Ben Gardon's avatar
      KVM: x86/mmu: Fast invalidation for TDP MMU · b7cccd39
      Ben Gardon authored
      
      
      Provide a real mechanism for fast invalidation by marking roots as
      invalid so that their reference count will quickly fall to zero
      and they will be torn down.
      
      One negative side affect of this approach is that a vCPU thread will
      likely drop the last reference to a root and be saddled with the work of
      tearing down an entire paging structure. This issue will be resolved in
      a later commit.
      
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210401233736.638171-13-bgardon@google.com>
      [Move the loop to tdp_mmu.c, otherwise compilation fails on 32-bit. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b7cccd39
  2. Apr 19, 2021
  3. Apr 17, 2021
    • Sean Christopherson's avatar
      KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot · 8931a454
      Sean Christopherson authored
      
      
      Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
      detected in the memslots, i.e. don't take the lock for notifications that
      don't affect the guest.
      
      For small VMs, spurious locking is a minor annoyance.  And for "volatile"
      setups where the majority of notifications _are_ relevant, this barely
      qualifies as an optimization.
      
      But, for large VMs (hundreds of threads) with static setups, e.g. no
      page migration, no swapping, etc..., the vast majority of MMU notifier
      callbacks will be unrelated to the guest, e.g. will often be in response
      to the userspace VMM adjusting its own virtual address space.  In such
      large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
      handling page faults.  In some scenarios it can even be "fatal" in the
      sense that it causes unacceptable brownouts, e.g. when rebuilding huge
      pages after live migration, a significant percentage of vCPUs will be
      attempting to handle page faults.
      
      x86's TDP MMU implementation is especially susceptible to spurious
      locking due it taking mmu_lock for read when handling page faults.
      Because rwlock is fair, a single writer will stall future readers, while
      the writer is itself stalled waiting for in-progress readers to complete.
      This is exacerbated by the MMU notifiers often firing multiple times in
      quick succession, e.g. moving a page will (always?) invoke three separate
      notifiers: .invalidate_range_start(), invalidate_range_end(), and
      .change_pte().  Unnecessarily taking mmu_lock each time means even a
      single spurious sequence can be problematic.
      
      Note, this optimizes only the unpaired callbacks.  Optimizing the
      .invalidate_range_{start,end}() pairs is more complex and will be done in
      a future patch.
      
      Suggested-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8931a454
    • Sean Christopherson's avatar
      KVM: Move MMU notifier's mmu_lock acquisition into common helper · f922bd9b
      Sean Christopherson authored
      
      
      Acquire and release mmu_lock in the __kvm_handle_hva_range() helper
      instead of requiring the caller to do the same.  This paves the way for
      future patches to take mmu_lock if and only if an overlapping memslot is
      found, without also having to introduce the on_lock() shenanigans used
      to manipulate the notifier count and sequence.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f922bd9b
    • Sean Christopherson's avatar
      KVM: Kill off the old hva-based MMU notifier callbacks · b4c5936c
      Sean Christopherson authored
      
      
      Yank out the hva-based MMU notifier APIs now that all architectures that
      use the notifiers have moved to the gfn-based APIs.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4c5936c
    • Sean Christopherson's avatar
      KVM: PPC: Convert to the gfn-based MMU notifier callbacks · b1c5356e
      Sean Christopherson authored
      
      
      Move PPC to the gfn-base MMU notifier APIs, and update all 15 bajillion
      PPC-internal hooks to work with gfns instead of hvas.
      
      No meaningful functional change intended, though the exact order of
      operations is slightly different since the memslot lookups occur before
      calling into arch code.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b1c5356e
    • Sean Christopherson's avatar
      KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks · d923ff25
      Sean Christopherson authored
      
      
      Move MIPS to the gfn-based MMU notifier APIs, which do the hva->gfn
      lookup in common code, and whose code is nearly identical to MIPS'
      lookup.
      
      No meaningful functional change intended, though the exact order of
      operations is slightly different since the memslot lookups occur before
      calling into arch code.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d923ff25
    • Sean Christopherson's avatar
      KVM: arm64: Convert to the gfn-based MMU notifier callbacks · cd4c7183
      Sean Christopherson authored
      
      
      Move arm64 to the gfn-base MMU notifier APIs, which do the hva->gfn
      lookup in common code.
      
      No meaningful functional change intended, though the exact order of
      operations is slightly different since the memslot lookups occur before
      calling into arch code.
      
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Tested-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd4c7183
    • Sean Christopherson's avatar
      KVM: Move x86's MMU notifier memslot walkers to generic code · 3039bcc7
      Sean Christopherson authored
      
      
      Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
      does a similar lookup, and some arch code is all but identical across
      multiple architectures.
      
      In addition to consolidating code, this will allow introducing
      optimizations that will benefit all architectures without incurring
      multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
      relevant range exists in the memslots.
      
      The use of __always_inline to avoid indirect call retpolines, as done by
      x86, may also benefit other architectures.
      
      Consolidating the lookups also fixes a wart in x86, where the legacy MMU
      and TDP MMU each do their own memslot walks.
      
      Lastly, future enhancements to the memslot implementation, e.g. to add an
      interval tree to track host address, will need to touch far less arch
      specific code.
      
      MIPS, PPC, and arm64 will be converted one at a time in future patches.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3039bcc7
    • Sean Christopherson's avatar
      KVM: Assert that notifier count is elevated in .change_pte() · c13fda23
      Sean Christopherson authored
      In KVM's .change_pte() notification callback, replace the notifier
      sequence bump with a WARN_ON assertion that the notifier count is
      elevated.  An elevated count provides stricter protections than bumping
      the sequence, and the sequence is guarnateed to be bumped before the
      count hits zero.
      
      When .change_pte() was added by commit 828502d3 ("ksm: add
      mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
      as .change_pte() would be invoked without any surrounding notifications.
      
      However, since commit 6bdb913f
      
       ("mm: wrap calls to set_pte_at_notify
      with invalidate_range_start and invalidate_range_end"), all calls to
      .change_pte() are guaranteed to be surrounded by start() and end(), and
      so are guaranteed to run with an elevated notifier count.
      
      Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
      bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
      the purpose of .change_pte().  Every arch's kvm_set_spte_hva() assumes
      .change_pte() is called when the relevant SPTE is present in KVM's MMU,
      as the original goal was to accelerate Kernel Samepage Merging (KSM) by
      updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
      the SPTE).  I.e. it means that .change_pte() is effectively dead code
      on _all_ architectures.
      
      x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
      is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
      which again should be a nop due to the invalidation.  arm64 is a bit
      murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
      called without a cache pointer, which means it will map an entry if and
      only if an existing PTE was found.
      
      For now, take advantage of the bug to simplify future consolidation of
      KVMs's MMU notifier code.   Doing so will not greatly complicate fixing
      .change_pte(), assuming it's even worth fixing.  .change_pte() has been
      broken for 8+ years and no one has complained.  Even if there are
      KSM+KVM users that care deeply about its performance, the benefits of
      avoiding VM-Exits via .change_pte() need to be reevaluated to justify
      the added complexity and testing burden.  Ripping out .change_pte()
      entirely would be a lot easier.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c13fda23
    • Paolo Bonzini's avatar
      KVM: MIPS: defer flush to generic MMU notifier code · fe9a5b05
      Paolo Bonzini authored
      
      
      Return 1 from kvm_unmap_hva_range and kvm_set_spte_hva if a flush is
      needed, so that the generic code can coalesce the flushes.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe9a5b05
    • Paolo Bonzini's avatar
      KVM: MIPS: let generic code call prepare_flush_shadow · 566a0bee
      Paolo Bonzini authored
      
      
      Since all calls to kvm_flush_remote_tlbs must be preceded by
      kvm_mips_callbacks->prepare_flush_shadow, repurpose
      kvm_arch_flush_remote_tlb to invoke it.  This makes it possible
      to use the TLB flushing mechanism provided by the generic MMU
      notifier code.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      566a0bee
    • Paolo Bonzini's avatar
      KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush · 5194552f
      Paolo Bonzini authored
      
      
      Both trap-and-emulate and VZ have a single implementation that covers
      both .flush_shadow_all and .flush_shadow_memslot, and both of them end
      with a call to kvm_flush_remote_tlbs.
      
      Unify the callbacks into one and extract the call to kvm_flush_remote_tlbs.
      The next patches will pull it further out of the the architecture-specific
      MMU notifier functions kvm_unmap_hva_range and kvm_set_spte_hva.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5194552f
    • Paolo Bonzini's avatar
      KVM: constify kvm_arch_flush_remote_tlbs_memslot · 6c9dd6d2
      Paolo Bonzini authored
      
      
      memslots are stored in RCU and there should be no need to
      change them.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c9dd6d2
    • Sean Christopherson's avatar
      KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations · 85f47930
      Sean Christopherson authored
      
      
      Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that
      that the allocations are accounted, to make it easier to audit KVM's
      allocations in the future, and to be consistent with other cache usage in
      KVM.
      
      When using SLAB/SLUB, this is a nop as the cache itself is created with
      SLAB_ACCOUNT.
      
      When using SLOB, there are caveats within caveats.  SLOB doesn't honor
      SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU
      allocations now being accounted.   But, even that depends on internal
      SLOB details as SLOB will only go to the page allocator when its cache is
      depleted.  That just happens to be extremely likely for vCPUs because the
      size of kvm_vcpu is larger than the a page for almost all combinations of
      architecture and page size.  Whether or not the SLOB behavior is by
      design is unknown; it's just as likely that no SLOB users care about
      accounding and so no one has bothered to implemented support in SLOB.
      Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup
      users, if any exist.
      
      Reviewed-by: default avatarWanpeng Li <kernellwp@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210406190740.4055679-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85f47930
    • Paolo Bonzini's avatar
      KVM: MMU: protect TDP MMU pages only down to required level · dbb6964e
      Paolo Bonzini authored
      When using manual protection of dirty pages, it is not necessary
      to protect nested page tables down to the 4K level; instead KVM
      can protect only hugepages in order to split them lazily, and
      delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG.
      This was overlooked in the TDP MMU, so do it there as well.
      
      Fixes: a6a0b05d
      
       ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
      Cc: Ben Gardon <bgardon@google.com>
      Reviewed-by: default avatarKeqian Zhu <zhukeqian1@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dbb6964e
    • Maxim Levitsky's avatar
      KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2 · a43b80b7
      Maxim Levitsky authored
      
      
      Define KVM_GUESTDBG_VALID_MASK and use it to implement this capabiity.
      Compile tested only.
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-6-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a43b80b7
    • Maxim Levitsky's avatar
      KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2 · fa18aca9
      Maxim Levitsky authored
      
      
      Move KVM_GUESTDBG_VALID_MASK to kvm_host.h
      and use it to return the value of this capability.
      Compile tested only.
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa18aca9
    • Maxim Levitsky's avatar
      KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2 · 7e582ccb
      Maxim Levitsky authored
      
      
      Store the supported bits into KVM_GUESTDBG_VALID_MASK
      macro, similar to how arm does this.
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7e582ccb
    • Paolo Bonzini's avatar
      KVM: introduce KVM_CAP_SET_GUEST_DEBUG2 · 8b13c364
      Paolo Bonzini authored
      
      
      This capability will allow the user to know which KVM_GUESTDBG_* bits
      are supported.
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8b13c364
    • Maxim Levitsky's avatar
      KVM: x86: pending exceptions must not be blocked by an injected event · 4020da3b
      Maxim Levitsky authored
      
      
      Injected interrupts/nmi should not block a pending exception,
      but rather be either lost if nested hypervisor doesn't
      intercept the pending exception (as in stock x86), or be delivered
      in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
      that corresponds to the pending exception.
      
      The only reason for an exception to be blocked is when nested run
      is pending (and that can't really happen currently
      but still worth checking for).
      
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4020da3b
    • Yang Yingliang's avatar
      KVM: selftests: remove redundant semi-colon · b9c36fde
      Yang Yingliang authored
      
      
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Message-Id: <20210401142514.1688199-1-yangyingliang@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9c36fde