Skip to content
  1. Jan 24, 2020
  2. Jan 23, 2020
  3. Jan 21, 2020
    • Milan Pandurov's avatar
      kvm: Refactor handling of VM debugfs files · 09cbcef6
      Milan Pandurov authored
      
      
      We can store reference to kvm_stats_debugfs_item instead of copying
      its values to kvm_stat_data.
      This allows us to remove duplicated code and usage of temporary
      kvm_stat_data inside vm_stat_get et al.
      
      Signed-off-by: default avatarMilan Pandurov <milanpa@amazon.de>
      Reviewed-by: default avatarAlexander Graf <graf@amazon.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      09cbcef6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM · e30a7d62
      Sean Christopherson authored
      
      
      Remove the bogus 64-bit only condition from the check that disables MMIO
      spte optimization when the system supports the max PA, i.e. doesn't have
      any reserved PA bits.  32-bit KVM always uses PAE paging for the shadow
      MMU, and per Intel's SDM:
      
        PAE paging translates 32-bit linear addresses to 52-bit physical
        addresses.
      
      The kernel's restrictions on max physical addresses are limits on how
      much memory the kernel can reasonably use, not what physical addresses
      are supported by hardware.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e30a7d62
    • Miaohe Lin's avatar
      KVM: nVMX: vmread should not set rflags to specify success in case of #PF · a4d956b9
      Miaohe Lin authored
      
      
      In case writing to vmread destination operand result in a #PF, vmread
      should not call nested_vmx_succeed() to set rflags to specify success.
      Similar to as done in VMPTRST (See handle_vmptrst()).
      
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a4d956b9
    • Sean Christopherson's avatar
      KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks · b5c3c1b3
      Sean Christopherson authored
      Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
      the checks as much as possible.  Move the check to a separate helper,
      __is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
      paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
      always zero for non-nEPT) while retaining the bitwise-OR of the current
      code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().
      
      Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
      future attempts to "fix" the code, which is what prompted this
      optimization in the first place[*].
      
      Opportunistically remove the superfluous '!= 0' and parantheses, and
      use BIT_ULL() instead of open coding its equivalent.
      
      The net effect is that code generation is largely unchanged for
      walk_shadow_page_get_mmio_spte(), marginally better for
      ept_prefetch_invalid_gpte(), and significantly improved for
      paging32/64_prefetch_invalid_gpte().
      
      Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
      the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
      KVM hasn't borked its EPT tables.  Even if it could be templated, the
      benefits of having a single implementation far outweight the few uops
      that would be saved for NPT or non-TDP paging, e.g. most compilers
      inline it all the way to up kvm_mmu_page_fault().
      
      [*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com
      
      
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5c3c1b3
    • Sean Christopherson's avatar
      KVM: x86/mmu: Reorder the reserved bit check in prefetch_invalid_gpte() · f8052a05
      Sean Christopherson authored
      
      
      Move the !PRESENT and !ACCESSED checks in FNAME(prefetch_invalid_gpte)
      above the call to is_rsvd_bits_set().  For a well behaved guest, the
      !PRESENT and !ACCESSED are far more likely to evaluate true than the
      reserved bit checks, and they do not require additional memory accesses.
      
      Before:
       Dump of assembler code for function paging32_prefetch_invalid_gpte:
         0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
         0x0000000000044245 <+5>:     mov    %rcx,%rax
         0x0000000000044248 <+8>:     shr    $0x7,%rax
         0x000000000004424c <+12>:    and    $0x1,%eax
         0x000000000004424f <+15>:    lea    0x0(,%rax,4),%r8
         0x0000000000044257 <+23>:    add    %r8,%rax
         0x000000000004425a <+26>:    mov    %rcx,%r8
         0x000000000004425d <+29>:    and    0x120(%rsi,%rax,8),%r8
         0x0000000000044265 <+37>:    mov    0x170(%rsi),%rax
         0x000000000004426c <+44>:    shr    %cl,%rax
         0x000000000004426f <+47>:    and    $0x1,%eax
         0x0000000000044272 <+50>:    or     %rax,%r8
         0x0000000000044275 <+53>:    jne    0x4427c <paging32_prefetch_invalid_gpte+60>
         0x0000000000044277 <+55>:    test   $0x1,%cl
         0x000000000004427a <+58>:    jne    0x4428a <paging32_prefetch_invalid_gpte+74>
         0x000000000004427c <+60>:    mov    %rdx,%rsi
         0x000000000004427f <+63>:    callq  0x44080 <drop_spte>
         0x0000000000044284 <+68>:    mov    $0x1,%eax
         0x0000000000044289 <+73>:    retq
         0x000000000004428a <+74>:    xor    %eax,%eax
         0x000000000004428c <+76>:    and    $0x20,%ecx
         0x000000000004428f <+79>:    jne    0x44289 <paging32_prefetch_invalid_gpte+73>
         0x0000000000044291 <+81>:    mov    %rdx,%rsi
         0x0000000000044294 <+84>:    callq  0x44080 <drop_spte>
         0x0000000000044299 <+89>:    mov    $0x1,%eax
         0x000000000004429e <+94>:    jmp    0x44289 <paging32_prefetch_invalid_gpte+73>
       End of assembler dump.
      
      After:
       Dump of assembler code for function paging32_prefetch_invalid_gpte:
         0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
         0x0000000000044245 <+5>:     test   $0x1,%cl
         0x0000000000044248 <+8>:     je     0x4424f <paging32_prefetch_invalid_gpte+15>
         0x000000000004424a <+10>:    test   $0x20,%cl
         0x000000000004424d <+13>:    jne    0x4425d <paging32_prefetch_invalid_gpte+29>
         0x000000000004424f <+15>:    mov    %rdx,%rsi
         0x0000000000044252 <+18>:    callq  0x44080 <drop_spte>
         0x0000000000044257 <+23>:    mov    $0x1,%eax
         0x000000000004425c <+28>:    retq
         0x000000000004425d <+29>:    mov    %rcx,%rax
         0x0000000000044260 <+32>:    mov    (%rsi),%rsi
         0x0000000000044263 <+35>:    shr    $0x7,%rax
         0x0000000000044267 <+39>:    and    $0x1,%eax
         0x000000000004426a <+42>:    lea    0x0(,%rax,4),%r8
         0x0000000000044272 <+50>:    add    %r8,%rax
         0x0000000000044275 <+53>:    mov    %rcx,%r8
         0x0000000000044278 <+56>:    and    0x120(%rsi,%rax,8),%r8
         0x0000000000044280 <+64>:    mov    0x170(%rsi),%rax
         0x0000000000044287 <+71>:    shr    %cl,%rax
         0x000000000004428a <+74>:    and    $0x1,%eax
         0x000000000004428d <+77>:    mov    %rax,%rcx
         0x0000000000044290 <+80>:    xor    %eax,%eax
         0x0000000000044292 <+82>:    or     %rcx,%r8
         0x0000000000044295 <+85>:    je     0x4425c <paging32_prefetch_invalid_gpte+28>
         0x0000000000044297 <+87>:    mov    %rdx,%rsi
         0x000000000004429a <+90>:    callq  0x44080 <drop_spte>
         0x000000000004429f <+95>:    mov    $0x1,%eax
         0x00000000000442a4 <+100>:   jmp    0x4425c <paging32_prefetch_invalid_gpte+28>
       End of assembler dump.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f8052a05
    • Tom Lendacky's avatar
      KVM: SVM: Override default MMIO mask if memory encryption is enabled · 52918ed5
      Tom Lendacky authored
      
      
      The KVM MMIO support uses bit 51 as the reserved bit to cause nested page
      faults when a guest performs MMIO. The AMD memory encryption support uses
      a CPUID function to define the encryption bit position. Given this, it is
      possible that these bits can conflict.
      
      Use svm_hardware_setup() to override the MMIO mask if memory encryption
      support is enabled. Various checks are performed to ensure that the mask
      is properly defined and rsvd_bits() is used to generate the new mask (as
      was done prior to the change that necessitated this patch).
      
      Fixes: 28a1f3ac ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Suggested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52918ed5
    • Miaohe Lin's avatar
      KVM: vmx: delete meaningless nested_vmx_prepare_msr_bitmap() declaration · d8010a77
      Miaohe Lin authored
      
      
      The function nested_vmx_prepare_msr_bitmap() declaration is below its
      implementation. So this is meaningless and should be removed.
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d8010a77
    • Sean Christopherson's avatar
      KVM: x86: Refactor and rename bit() to feature_bit() macro · 87382003
      Sean Christopherson authored
      
      
      Rename bit() to __feature_bit() to give it a more descriptive name, and
      add a macro, feature_bit(), to stuff the X68_FEATURE_ prefix to keep
      line lengths manageable for code that hardcodes the bit to be retrieved.
      
      No functional change intended.
      
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      87382003
    • Sean Christopherson's avatar
      KVM: x86: Expand build-time assertion on reverse CPUID usage · a7c48c3f
      Sean Christopherson authored
      
      
      Add build-time checks to ensure KVM isn't trying to do a reverse CPUID
      lookup on Linux-defined feature bits, along with comments to explain
      the gory details of X86_FEATUREs and bit().
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a7c48c3f
    • Sean Christopherson's avatar
      KVM: x86: Add CPUID_7_1_EAX to the reverse CPUID table · daa0d8c3
      Sean Christopherson authored
      
      
      Add an entry for CPUID_7_1_EAX in the reserve_cpuid array in preparation
      for incorporating the array in bit() build-time assertions, specifically
      to avoid an assertion on F(AVX512_BF16) in do_cpuid_7_mask().
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      daa0d8c3
    • Sean Christopherson's avatar
      KVM: x86: Move bit() helper to cpuid.h · a0a2260c
      Sean Christopherson authored
      
      
      Move bit() to cpuid.h in preparation for incorporating the reverse_cpuid
      array in bit() build-time assertions.  Opportunistically use the BIT()
      macro instead of open-coding the shift.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a0a2260c
    • Sean Christopherson's avatar
      KVM: x86: Add dedicated emulator helpers for querying CPUID features · 5ae78e95
      Sean Christopherson authored
      
      
      Add feature-specific helpers for querying guest CPUID support from the
      emulator instead of having the emulator do a full CPUID and perform its
      own bit tests.  The primary motivation is to eliminate the emulator's
      usage of bit() so that future patches can add more extensive build-time
      assertions on the usage of bit() without having to expose yet more code
      to the emulator.
      
      Note, providing a generic guest_cpuid_has() to the emulator doesn't work
      due to the existing built-time assertions in guest_cpuid_has(), which
      require the feature being checked to be a compile-time constant.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ae78e95
    • Sean Christopherson's avatar
      KVM: x86: Add macro to ensure reserved cr4 bits checks stay in sync · 345599f9
      Sean Christopherson authored
      
      
      Add a helper macro to generate the set of reserved cr4 bits for both
      host and guest to ensure that adding a check on guest capabilities is
      also added for host capabilities, and vice versa.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      345599f9
    • Sean Christopherson's avatar
      KVM: x86: Drop special XSAVE handling from guest_cpuid_has() · 96be4e06
      Sean Christopherson authored
      
      
      Now that KVM prevents setting host-reserved CR4 bits, drop the dedicated
      XSAVE check in guest_cpuid_has() in favor of open coding similar checks
      in the SVM/VMX XSAVES enabling flows.
      
      Note, checking boot_cpu_has(X86_FEATURE_XSAVE) in the XSAVES flows is
      technically redundant with respect to the CR4 reserved bit checks, e.g.
      XSAVES #UDs if CR4.OSXSAVE=0 and arch.xsaves_enabled is consumed if and
      only if CR4.OXSAVE=1 in guest.  Keep (add?) the explicit boot_cpu_has()
      checks to help document KVM's usage of arch.xsaves_enabled.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      96be4e06
    • Sean Christopherson's avatar
      KVM: x86: Ensure all logical CPUs have consistent reserved cr4 bits · f1cdecf5
      Sean Christopherson authored
      
      
      Check the current CPU's reserved cr4 bits against the mask calculated
      for the boot CPU to ensure consistent behavior across all CPUs.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f1cdecf5
    • Sean Christopherson's avatar
      KVM: x86: Don't let userspace set host-reserved cr4 bits · b11306b5
      Sean Christopherson authored
      
      
      Calculate the host-reserved cr4 bits at runtime based on the system's
      capabilities (using logic similar to __do_cpuid_func()), and use the
      dynamically generated mask for the reserved bit check in kvm_set_cr4()
      instead using of the static CR4_RESERVED_BITS define.  This prevents
      userspace from "enabling" features in cr4 that are not supported by the
      system, e.g. by ignoring KVM_GET_SUPPORTED_CPUID and specifying a bogus
      CPUID for the vCPU.
      
      Allowing userspace to set unsupported bits in cr4 can lead to a variety
      of undesirable behavior, e.g. failed VM-Enter, and in general increases
      KVM's attack surface.  A crafty userspace can even abuse CR4.LA57 to
      induce an unchecked #GP on a WRMSR.
      
      On a platform without LA57 support:
      
        KVM_SET_CPUID2 // CPUID_7_0_ECX.LA57 = 1
        KVM_SET_SREGS  // CR4.LA57 = 1
        KVM_SET_MSRS   // KERNEL_GS_BASE = 0x0004000000000000
        KVM_RUN
      
      leads to a #GP when writing KERNEL_GS_BASE into hardware:
      
        unchecked MSR access error: WRMSR to 0xc0000102 (tried to write 0x0004000000000000)
        at rIP: 0xffffffffa00f239a (vmx_prepare_switch_to_guest+0x10a/0x1d0 [kvm_intel])
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x671/0x1c70 [kvm]
         kvm_vcpu_ioctl+0x36b/0x5d0 [kvm]
         do_vfs_ioctl+0xa1/0x620
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4c/0x170
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fc08133bf47
      
      Note, the above sequence fails VM-Enter due to invalid guest state.
      Userspace can allow VM-Enter to succeed (after the WRMSR #GP) by adding
      a KVM_SET_SREGS w/ CR4.LA57=0 after KVM_SET_MSRS, in which case KVM will
      technically leak the host's KERNEL_GS_BASE into the guest.  But, as
      KERNEL_GS_BASE is a userspace-defined value/address, the leak is largely
      benign as a malicious userspace would simply be exposing its own data to
      the guest, and attacking a benevolent userspace would require multiple
      bugs in the userspace VMM.
      
      Cc: stable@vger.kernel.org
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b11306b5