Skip to content
  1. Mar 21, 2022
  2. Mar 19, 2022
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · 714797c9
      Paolo Bonzini authored
      KVM/arm64 updates for 5.18
      
      - Proper emulation of the OSLock feature of the debug architecture
      
      - Scalibility improvements for the MMU lock when dirty logging is on
      
      - New VMID allocator, which will eventually help with SVA in VMs
      
      - Better support for PMUs in heterogenous systems
      
      - PSCI 1.1 support, enabling support for SYSTEM_RESET2
      
      - Implement CONFIG_DEBUG_LIST at EL2
      
      - Make CONFIG_ARM64_ERRATUM_2077057 default y
      
      - Reduce the overhead of VM exit when no interrupt is pending
      
      - Remove traces of 32bit ARM host support from the documentation
      
      - Updated vgic selftests
      
      - Various cleanups, doc updates and spelling fixes
      714797c9
  3. Mar 18, 2022
  4. Mar 16, 2022
  5. Mar 14, 2022
  6. Mar 11, 2022
  7. Mar 10, 2022
  8. Mar 09, 2022
  9. Mar 08, 2022
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255 · 4a204f78
      Suravee Suthikulpanit authored
      
      
      Expand KVM's mask for the AVIC host physical ID to the full 12 bits defined
      by the architecture.  The number of bits consumed by hardware is model
      specific, e.g. early CPUs ignored bits 11:8, but there is no way for KVM
      to enumerate the "true" size.  So, KVM must allow using all bits, else it
      risks rejecting completely legal x2APIC IDs on newer CPUs.
      
      This means KVM relies on hardware to not assign x2APIC IDs that exceed the
      "true" width of the field, but presumably hardware is smart enough to tie
      the width to the max x2APIC ID.  KVM also relies on hardware to support at
      least 8 bits, as the legacy xAPIC ID is writable by software.  But, those
      assumptions are unavoidable due to the lack of any way to enumerate the
      "true" width.
      
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Fixes: 44a95dae ("KVM: x86: Detect and Initialize AVIC support")
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220211000851.185799-1-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a204f78
    • Sean Christopherson's avatar
      KVM: selftests: Add test to populate a VM with the max possible guest mem · b58c55d5
      Sean Christopherson authored
      
      
      Add a selftest that enables populating a VM with the maximum amount of
      guest memory allowed by the underlying architecture.  Abuse KVM's
      memslots by mapping a single host memory region into multiple memslots so
      that the selftest doesn't require a system with terabytes of RAM.
      
      Default to 512gb of guest memory, which isn't all that interesting, but
      should work on all MMUs and doesn't take an exorbitant amount of memory
      or time.  E.g. testing with ~64tb of guest memory takes the better part
      of an hour, and requires 200gb of memory for KVM's page tables when using
      4kb pages.
      
      To inflicit maximum abuse on KVM' MMU, default to 4kb pages (or whatever
      the not-hugepage size is) in the backing store (memfd).  Use memfd for
      the host backing store to ensure that hugepages are guaranteed when
      requested, and to give the user explicit control of the size of hugepage
      being tested.
      
      By default, spin up as many vCPUs as there are available to the selftest,
      and distribute the work of dirtying each 4kb chunk of memory across all
      vCPUs.  Dirtying guest memory forces KVM to populate its page tables, and
      also forces KVM to write back accessed/dirty information to struct page
      when the guest memory is freed.
      
      On x86, perform two passes with a MMU context reset between each pass to
      coerce KVM into dropping all references to the MMU root, e.g. to emulate
      a vCPU dropping the last reference.  Perform both passes and all
      rendezvous on all architectures in the hope that arm64 and s390x can gain
      similar shenanigans in the future.
      
      Measure and report the duration of each operation, which is helpful not
      only to verify the test is working as intended, but also to easily
      evaluate the performance differences different page sizes.
      
      Provide command line options to limit the amount of guest memory, set the
      size of each slot (i.e. of the host memory region), set the number of
      vCPUs, and to enable usage of hugepages.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-29-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b58c55d5
    • Sean Christopherson's avatar
      KVM: selftests: Define cpu_relax() helpers for s390 and x86 · 17ae5ebc
      Sean Christopherson authored
      
      
      Add cpu_relax() for s390 and x86 for use in arch-agnostic tests.  arm64
      already defines its own version.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-28-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17ae5ebc
    • Sean Christopherson's avatar
      KVM: selftests: Split out helper to allocate guest mem via memfd · a4187c9b
      Sean Christopherson authored
      
      
      Extract the code for allocating guest memory via memfd out of
      vm_userspace_mem_region_add() and into a new helper, kvm_memfd_alloc().
      A future selftest to populate a guest with the maximum amount of guest
      memory will abuse KVM's memslots to alias guest memory regions to a
      single memfd-backed host region, i.e. needs to back a guest with memfd
      memory without a 1:1 association between a memslot and a memfd instance.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-27-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a4187c9b
    • Sean Christopherson's avatar
      KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils · 3d7d6043
      Sean Christopherson authored
      
      
      Move set_memory_region_test's KVM_SET_USER_MEMORY_REGION helper to KVM's
      utils so that it can be used by other tests.  Provide a raw version as
      well as an assert-success version to reduce the amount of boilerplate
      code need for basic usage.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-26-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3d7d6043
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE · 396fd74d
      Sean Christopherson authored
      
      
      Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE.
      This solves a conundrum introduced by commit 3255530a ("KVM: x86/mmu:
      Automatically update iter->old_spte if cmpxchg fails"); if the helper
      doesn't update old_spte in the REMOVED case, then theoretically the
      caller could get stuck in an infinite loop as it will fail indefinitely
      on the REMOVED SPTE.  E.g. until recently, clear_dirty_gfn_range() didn't
      check for a present SPTE and would have spun until getting rescheduled.
      
      In practice, only the page fault path should "create" a new SPTE, all
      other paths should only operate on existing, a.k.a. shadow present,
      SPTEs.  Now that the page fault path pre-checks for a REMOVED SPTE in all
      cases, require all other paths to indirectly pre-check by verifying the
      target SPTE is a shadow-present SPTE.
      
      Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that
      scenario disallowed.  The invariant is only that the caller mustn't
      invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last
      observed by the caller.
      
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-25-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      396fd74d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE · 58298b06
      Sean Christopherson authored
      
      
      Explicitly check for a REMOVED leaf SPTE prior to attempting to map
      the final SPTE when handling a TDP MMU fault.  Functionally, this is a
      nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
      Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
      is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
      SPTE is never a REMOVED SPTE.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-24-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      58298b06
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Zap defunct roots via asynchronous worker · efd995da
      Paolo Bonzini authored
      
      
      Zap defunct roots, a.k.a. roots that have been invalidated after their
      last reference was initially dropped, asynchronously via the existing work
      queue instead of forcing the work upon the unfortunate task that happened
      to drop the last reference.
      
      If a vCPU task drops the last reference, the vCPU is effectively blocked
      by the host for the entire duration of the zap.  If the root being zapped
      happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
      being active, the zap can take several hundred seconds.  Unsurprisingly,
      most guests are unhappy if a vCPU disappears for hundreds of seconds.
      
      E.g. running a synthetic selftest that triggers a vCPU root zap with
      ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
      Offloading the zap to a worker drops the block time to <100ms.
      
      There is an important nuance to this change.  If the same work item
      was queued twice before the work function has run, it would only
      execute once and one reference would be leaked.  Therefore, now that
      queueing and flushing items is not anymore protected by kvm->slots_lock,
      kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
      skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
      must return only after those skipped roots have been zapped as well.
      These two requirements can be satisfied only if _all_ places that
      change invalid to true now schedule the worker before releasing the
      mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
      kvm_tdp_mmu_invalidate_all_roots().
      
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      efd995da