Skip to content
  1. Apr 22, 2021
    • Brijesh Singh's avatar
      KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command · 15fb7de1
      Brijesh Singh authored
      
      
      The command is used for copying the incoming buffer into the
      SEV guest memory space.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <c5d0e3e719db7bb37ea85d79ed4db52e9da06257.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15fb7de1
    • Brijesh Singh's avatar
      KVM: SVM: Add support for KVM_SEV_RECEIVE_START command · af43cbbf
      Brijesh Singh authored
      
      
      The command is used to create the encryption context for an incoming
      SEV guest. The encryption context can be later used by the hypervisor
      to import the incoming data into the SEV guest memory space.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <c7400111ed7458eee01007c4d8d57cdf2cbb0fc2.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af43cbbf
    • Steve Rutherford's avatar
      KVM: SVM: Add support for KVM_SEV_SEND_CANCEL command · 5569e2e7
      Steve Rutherford authored
      
      
      After completion of SEND_START, but before SEND_FINISH, the source VMM can
      issue the SEND_CANCEL command to stop a migration. This is necessary so
      that a cancelled migration can restart with a new target later.
      
      Reviewed-by: default avatarNathan Tempelman <natet@google.com>
      Reviewed-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarSteve Rutherford <srutherford@google.com>
      Message-Id: <20210412194408.2458827-1-srutherford@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5569e2e7
    • Brijesh Singh's avatar
      KVM: SVM: Add KVM_SEV_SEND_FINISH command · fddecf6a
      Brijesh Singh authored
      
      
      The command is used to finailize the encryption context created with
      KVM_SEV_SEND_START command.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <5082bd6a8539d24bc55a1dd63a1b341245bb168f.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fddecf6a
    • Brijesh Singh's avatar
      KVM: SVM: Add KVM_SEND_UPDATE_DATA command · d3d1af85
      Brijesh Singh authored
      
      
      The command is used for encrypting the guest memory region using the encryption
      context created with KVM_SEV_SEND_START.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by : Steve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <d6a6ea740b0c668b30905ae31eac5ad7da048bb3.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d3d1af85
    • Brijesh Singh's avatar
      KVM: SVM: Add KVM_SEV SEND_START command · 4cfdd47d
      Brijesh Singh authored
      
      
      The command is used to create an outgoing SEV guest encryption context.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Reviewed-by: default avatarVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <2f1686d0164e0f1b3d6a41d620408393e0a48376.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cfdd47d
    • Wanpeng Li's avatar
      KVM: Boost vCPU candidate in user mode which is delivering interrupt · 52acd22f
      Wanpeng Li authored
      
      
      Both lock holder vCPU and IPI receiver that has halted are condidate for
      boost. However, the PLE handler was originally designed to deal with the
      lock holder preemption problem. The Intel PLE occurs when the spinlock
      waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
      they can be in either kernel or user mode. the vCPU candidate in user mode
      will not be boosted even if they should respond to IPIs. Some benchmarks
      like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
      of the time they are running in user mode. It can lead to a large number
      of continuous PLE events because the IPI sender causes PLE events
      repeatedly until the receiver is scheduled while the receiver is not
      candidate for a boost.
      
      This patch boosts the vCPU candidiate in user mode which is delivery
      interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
      VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
      96 HTs Intel CLX box). There is no performance regression for other
      benchmarks like Unixbench spawn (most of the time contend read/write
      lock in kernel mode), ebizzy (most of the time contend read/write sem
      and TLB shoodtdown in kernel mode).
      
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52acd22f
    • Paolo Bonzini's avatar
    • Paolo Bonzini's avatar
      KVM: selftests: Always run vCPU thread with blocked SIG_IPI · bf1e15a8
      Paolo Bonzini authored
      
      
      The main thread could start to send SIG_IPI at any time, even before signal
      blocked on vcpu thread.  Therefore, start the vcpu thread with the signal
      blocked.
      
      Without this patch, on very busy cores the dirty_log_test could fail directly
      on receiving a SIGUSR1 without a handler (when vcpu runs far slower than main).
      
      Reported-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bf1e15a8
    • Peter Xu's avatar
      KVM: selftests: Sync data verify of dirty logging with guest sync · 016ff1a4
      Peter Xu authored
      This fixes a bug that can trigger with e.g. "taskset -c 0 ./dirty_log_test" or
      when the testing host is very busy.
      
      A similar previous attempt is done [1] but that is not enough, the reason is
      stated in the reply [2].
      
      As a summary (partly quotting from [2]):
      
      The problem is I think one guest memory write operation (of this specific test)
      contains a few micro-steps when page is during kvm dirty tracking (here I'm
      only considering write-protect rather than pml but pml should be similar at
      least when the log buffer is full):
      
        (1) Guest read 'iteration' number into register, prepare to write, page fault
        (2) Set dirty bit in either dirty bitmap or dirty ring
        (3) Return to guest, data written
      
      When we verify the data, we assumed that all these steps are "atomic", say,
      when (1) happened for this page, we assume (2) & (3) must have happened.  We
      had some trick to workaround "un-atomicity" of above three steps, as previous
      version of this patch wanted to fix atomicity of step (2)+(3) by explicitly
      letting the main thread wait for at least one vmenter of vcpu thread, which
      should work.  However what I overlooked is probably that we still have race
      when (1) and (2) can be interrupted.
      
      One example calltrace when it could happen that we read an old interation, got
      interrupted before even setting the dirty bit and flushing data:
      
          __schedule+1742
          __cond_resched+52
          __get_user_pages+530
          get_user_pages_unlocked+197
          hva_to_pfn+206
          try_async_pf+132
          direct_page_fault+320
          kvm_mmu_page_fault+103
          vmx_handle_exit+288
          vcpu_enter_guest+2460
          kvm_arch_vcpu_ioctl_run+325
          kvm_vcpu_ioctl+526
          __x64_sys_ioctl+131
          do_syscall_64+51
          entry_SYSCALL_64_after_hwframe+68
      
      It means iteration number cached in vcpu register can be very old when dirty
      bit set and data flushed.
      
      So far I don't see an easy way to guarantee all steps 1-3 atomicity but to sync
      at the GUEST_SYNC() point of guest code when we do verification of the dirty
      bits as what this patch does.
      
      [1] https://lore.kernel.org/lkml/20210413213641.23742-1-peterx@redhat.com/
      [2] https://lore.kernel.org/lkml/20210417140956.GV4440@xz-x1/
      
      
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210417143602.215059-2-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      016ff1a4
    • Nathan Tempelman's avatar
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman authored
      
      
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      
      Signed-off-by: default avatarNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54526d1f
    • Krish Sadhukhan's avatar
      nSVM: Check addresses of MSR and IO permission maps · ee695f22
      Krish Sadhukhan authored
      
      
      According to section "Canonicalization and Consistency Checks" in APM vol 2,
      the following guest state is illegal:
      
          "The MSR or IOIO intercept tables extend to a physical address that
           is greater than or equal to the maximum supported physical address."
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Message-Id: <20210412215611.110095-5-krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee695f22
  2. Apr 20, 2021
    • Krish Sadhukhan's avatar
      KVM: SVM: Define actual size of IOPM and MSRPM tables · 47903dc1
      Krish Sadhukhan authored
      
      
      Define the actual size of the IOPM and MSRPM tables so that the actual size
      can be used when initializing them and when checking the consistency of their
      physical address.
      These #defines are placed in svm.h so that they can be shared.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Message-Id: <20210412215611.110095-2-krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47903dc1
    • Sean Christopherson's avatar
      KVM: x86: Add capability to grant VM access to privileged SGX attribute · fe7e9488
      Sean Christopherson authored
      
      
      Add a capability, KVM_CAP_SGX_ATTRIBUTE, that can be used by userspace
      to grant a VM access to a priveleged attribute, with args[0] holding a
      file handle to a valid SGX attribute file.
      
      The SGX subsystem restricts access to a subset of enclave attributes to
      provide additional security for an uncompromised kernel, e.g. to prevent
      malware from using the PROVISIONKEY to ensure its nodes are running
      inside a geniune SGX enclave and/or to obtain a stable fingerprint.
      
      To prevent userspace from circumventing such restrictions by running an
      enclave in a VM, KVM restricts guest access to privileged attributes by
      default.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <0b099d65e933e068e3ea934b0523bab070cb8cea.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe7e9488
    • Sean Christopherson's avatar
      KVM: VMX: Enable SGX virtualization for SGX1, SGX2 and LC · 72add915
      Sean Christopherson authored
      
      
      Enable SGX virtualization now that KVM has the VM-Exit handlers needed
      to trap-and-execute ENCLS to ensure correctness and/or enforce the CPU
      model exposed to the guest.  Add a KVM module param, "sgx", to allow an
      admin to disable SGX virtualization independent of the kernel.
      
      When supported in hardware and the kernel, advertise SGX1, SGX2 and SGX
      LC to userspace via CPUID and wire up the ENCLS_EXITING bitmap based on
      the guest's SGX capabilities, i.e. to allow ENCLS to be executed in an
      SGX-enabled guest.  With the exception of the provision key, all SGX
      attribute bits may be exposed to the guest.  Guest access to the
      provision key, which is controlled via securityfs, will be added in a
      future patch.
      
      Note, KVM does not yet support exposing ENCLS_C leafs or ENCLV leafs.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <a99e9c23310c79f2f4175c1af4c4cbcef913c3e5.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      72add915
    • Sean Christopherson's avatar
      KVM: VMX: Add ENCLS[EINIT] handler to support SGX Launch Control (LC) · b6f084ca
      Sean Christopherson authored
      
      
      Add a VM-Exit handler to trap-and-execute EINIT when SGX LC is enabled
      in the host.  When SGX LC is enabled, the host kernel may rewrite the
      hardware values at will, e.g. to launch enclaves with different signers,
      thus KVM needs to intercept EINIT to ensure it is executed with the
      correct LE hash (even if the guest sees a hardwired hash).
      
      Switching the LE hash MSRs on VM-Enter/VM-Exit is not a viable option as
      writing the MSRs is prohibitively expensive, e.g. on SKL hardware each
      WRMSR is ~400 cycles.  And because EINIT takes tens of thousands of
      cycles to execute, the ~1500 cycle overhead to trap-and-execute EINIT is
      unlikely to be noticed by the guest, let alone impact its overall SGX
      performance.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <57c92fa4d2083eb3be9e6355e3882fc90cffea87.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6f084ca
    • Sean Christopherson's avatar
      KVM: VMX: Add emulation of SGX Launch Control LE hash MSRs · 8f102445
      Sean Christopherson authored
      
      
      Emulate the four Launch Enclave public key hash MSRs (LE hash MSRs) that
      exist on CPUs that support SGX Launch Control (LC).  SGX LC modifies the
      behavior of ENCLS[EINIT] to use the LE hash MSRs when verifying the key
      used to sign an enclave.  On CPUs without LC support, the LE hash is
      hardwired into the CPU to an Intel controlled key (the Intel key is also
      the reset value of the LE hash MSRs). Track the guest's desired hash so
      that a future patch can stuff the hash into the hardware MSRs when
      executing EINIT on behalf of the guest, when those MSRs are writable in
      host.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <c58ef601ddf88f3a113add837969533099b1364a.1618196135.git.kai.huang@intel.com>
      [Add a comment regarding the MSRs being available until SGX is locked.
       - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f102445
    • Sean Christopherson's avatar
      KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions · 70210c04
      Sean Christopherson authored
      
      
      Add an ECREATE handler that will be used to intercept ECREATE for the
      purpose of enforcing and enclave's MISCSELECT, ATTRIBUTES and XFRM, i.e.
      to allow userspace to restrict SGX features via CPUID.  ECREATE will be
      intercepted when any of the aforementioned masks diverges from hardware
      in order to enforce the desired CPUID model, i.e. inject #GP if the
      guest attempts to set a bit that hasn't been enumerated as allowed-1 in
      CPUID.
      
      Note, access to the PROVISIONKEY is not yet supported.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <c3a97684f1b71b4f4626a1fc3879472a95651725.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70210c04
    • Sean Christopherson's avatar
      KVM: VMX: Frame in ENCLS handler for SGX virtualization · 9798adbc
      Sean Christopherson authored
      
      
      Introduce sgx.c and sgx.h, along with the framework for handling ENCLS
      VM-Exits.  Add a bool, enable_sgx, that will eventually be wired up to a
      module param to control whether or not SGX virtualization is enabled at
      runtime.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <1c782269608b2f5e1034be450f375a8432fb705d.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9798adbc
    • Sean Christopherson's avatar
      KVM: VMX: Add basic handling of VM-Exit from SGX enclave · 3c0c2ad1
      Sean Christopherson authored
      
      
      Add support for handling VM-Exits that originate from a guest SGX
      enclave.  In SGX, an "enclave" is a new CPL3-only execution environment,
      wherein the CPU and memory state is protected by hardware to make the
      state inaccesible to code running outside of the enclave.  When exiting
      an enclave due to an asynchronous event (from the perspective of the
      enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
      is automatically saved and scrubbed (the CPU loads synthetic state), and
      then reloaded when re-entering the enclave.  E.g. after an instruction
      based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
      of the enclave instruction that trigered VM-Exit, but will instead point
      to a RIP in the enclave's untrusted runtime (the guest userspace code
      that coordinates entry/exit to/from the enclave).
      
      To help a VMM recognize and handle exits from enclaves, SGX adds bits to
      existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
      GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR.  Define the
      new architectural bits, and add a boolean to struct vcpu_vmx to cache
      VMX_EXIT_REASON_FROM_ENCLAVE.  Clear the bit in exit_reason so that
      checks against exit_reason do not need to account for SGX, e.g.
      "if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.
      
      KVM is a largely a passive observer of the new bits, e.g. KVM needs to
      account for the bits when propagating information to a nested VMM, but
      otherwise doesn't need to act differently for the majority of VM-Exits
      from enclaves.
      
      The one scenario that is directly impacted is emulation, which is for
      all intents and purposes impossible[1] since KVM does not have access to
      the RIP or instruction stream that triggered the VM-Exit.  The inability
      to emulate is a non-issue for KVM, as most instructions that might
      trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
      check.  For the few instruction that conditionally #UD, KVM either never
      sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
      if the feature is not exposed to the guest in order to inject a #UD,
      e.g. RDRAND_EXITING.
      
      But, because it is still possible for a guest to trigger emulation,
      e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
      from an enclave.  This is architecturally accurate for instruction
      VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
      to killing the VM.  In practice, only broken or particularly stupid
      guests should ever encounter this behavior.
      
      Add a WARN in skip_emulated_instruction to detect any attempt to
      modify the guest's RIP during an SGX enclave VM-Exit as all such flows
      should either be unreachable or must handle exits from enclaves before
      getting to skip_emulated_instruction.
      
      [1] Impossible for all practical purposes.  Not truly impossible
          since KVM could implement some form of para-virtualization scheme.
      
      [2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
          CPL3, so we also don't need to worry about that interaction.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c0c2ad1
    • Sean Christopherson's avatar
      KVM: x86: Add reverse-CPUID lookup support for scattered SGX features · 01de8682
      Sean Christopherson authored
      
      
      Define a new KVM-only feature word for advertising and querying SGX
      sub-features in CPUID.0x12.0x0.EAX.  Because SGX1 and SGX2 are scattered
      in the kernel's feature word, they need to be translated so that the
      bit numbers match those of hardware.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <e797c533f4c71ae89265bbb15a02aef86b67cbec.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      01de8682
    • Sean Christopherson's avatar
      KVM: x86: Add support for reverse CPUID lookup of scattered features · 4e66c0cb
      Sean Christopherson authored
      Introduce a scheme that allows KVM's CPUID magic to support features
      that are scattered in the kernel's feature words.  To advertise and/or
      query guest support for CPUID-based features, KVM requires the bit
      number of an X86_FEATURE_* to match the bit number in its associated
      CPUID entry.  For scattered features, this does not hold true.
      
      Add a framework to allow defining KVM-only words, stored in
      kvm_cpu_caps after the shared kernel caps, that can be used to gather
      the scattered feature bits by translating X86_FEATURE_* flags into their
      KVM-defined feature.
      
      Note, because reverse_cpuid_check() effectively forces kvm_cpu_caps
      lookups to be resolved at compile time, there is no runtime cost for
      translating from kernel-defined to kvm-defined features.
      
      More details here:  https://lkml.kernel.org/r/X/jxCOLG+HUO4QlZ@google.com
      
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <16cad8d00475f67867fb36701fc7fb7c1ec86ce1.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4e66c0cb
    • Sean Christopherson's avatar
      KVM: x86: Define new #PF SGX error code bit · 00e7646c
      Sean Christopherson authored
      
      
      Page faults that are signaled by the SGX Enclave Page Cache Map (EPCM),
      as opposed to the traditional IA32/EPT page tables, set an SGX bit in
      the error code to indicate that the #PF was induced by SGX.  KVM will
      need to emulate this behavior as part of its trap-and-execute scheme for
      virtualizing SGX Launch Control, e.g. to inject SGX-induced #PFs if
      EINIT faults in the host, and to support live migration.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <e170c5175cb9f35f53218a7512c9e3db972b97a2.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00e7646c
    • Sean Christopherson's avatar
      KVM: x86: Export kvm_mmu_gva_to_gpa_{read,write}() for SGX (VMX) · 54f958cd
      Sean Christopherson authored
      
      
      Export the gva_to_gpa() helpers for use by SGX virtualization when
      executing ENCLS[ECREATE] and ENCLS[EINIT] on behalf of the guest.
      To execute ECREATE and EINIT, KVM must obtain the GPA of the target
      Secure Enclave Control Structure (SECS) in order to get its
      corresponding HVA.
      
      Because the SECS must reside in the Enclave Page Cache (EPC), copying
      the SECS's data to a host-controlled buffer via existing exported
      helpers is not a viable option as the EPC is not readable or writable
      by the kernel.
      
      SGX virtualization will also use gva_to_gpa() to obtain HVAs for
      non-EPC pages in order to pass user pointers directly to ECREATE and
      EINIT, which avoids having to copy pages worth of data into the kernel.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <02f37708321bcdfaa2f9d41c8478affa6e84b04d.1618196135.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54f958cd
    • Yanan Wang's avatar
      KVM: selftests: Add a test for kvm page table code · b9c2bd50
      Yanan Wang authored
      
      
      This test serves as a performance tester and a bug reproducer for
      kvm page table code (GPA->HPA mappings), so it gives guidance for
      people trying to make some improvement for kvm.
      
      The function guest_code() can cover the conditions where a single vcpu or
      multiple vcpus access guest pages within the same memory region, in three
      VM stages(before dirty logging, during dirty logging, after dirty logging).
      Besides, the backing src memory type(ANONYMOUS/THP/HUGETLB) of the tested
      memory region can be specified by users, which means normal page mappings
      or block mappings can be chosen by users to be created in the test.
      
      If ANONYMOUS memory is specified, kvm will create normal page mappings
      for the tested memory region before dirty logging, and update attributes
      of the page mappings from RO to RW during dirty logging. If THP/HUGETLB
      memory is specified, kvm will create block mappings for the tested memory
      region before dirty logging, and split the blcok mappings into normal page
      mappings during dirty logging, and coalesce the page mappings back into
      block mappings after dirty logging is stopped.
      
      So in summary, as a performance tester, this test can present the
      performance of kvm creating/updating normal page mappings, or the
      performance of kvm creating/splitting/recovering block mappings,
      through execution time.
      
      When we need to coalesce the page mappings back to block mappings after
      dirty logging is stopped, we have to firstly invalidate *all* the TLB
      entries for the page mappings right before installation of the block entry,
      because a TLB conflict abort error could occur if we can't invalidate the
      TLB entries fully. We have hit this TLB conflict twice on aarch64 software
      implementation and fixed it. As this test can imulate process from dirty
      logging enabled to dirty logging stopped of a VM with block mappings,
      so it can also reproduce this TLB conflict abort due to inadequate TLB
      invalidation when coalescing tables.
      
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-11-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9c2bd50
    • Yanan Wang's avatar
      KVM: selftests: Adapt vm_userspace_mem_region_add to new helpers · a4b3c8b5
      Yanan Wang authored
      
      
      With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(),
      we have to get the transparent hugepage size for HVA alignment. With the
      new helpers, we can use get_backing_src_pagesz() to check whether THP is
      configured and then get the exact configured hugepage size.
      
      As different architectures may have different THP page sizes configured,
      this can get the accurate THP page sizes on any platform.
      
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-10-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a4b3c8b5
    • Yanan Wang's avatar
      KVM: selftests: List all hugetlb src types specified with page sizes · 623653b7
      Yanan Wang authored
      
      
      With VM_MEM_SRC_ANONYMOUS_HUGETLB, we currently can only use system
      default hugetlb pages to back the testing guest memory. In order to
      add flexibility, now list all the known hugetlb backing src types with
      different page sizes, so that we can specify use of hugetlb pages of the
      exact granularity that we want. And as all the known hugetlb page sizes
      are listed, it's appropriate for all architectures.
      
      Besides, the helper get_backing_src_pagesz() is added to get the
      granularity of different backing src types(anonumous, thp, hugetlb).
      
      Suggested-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-9-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      623653b7
    • Yanan Wang's avatar
      KVM: selftests: Add a helper to get system default hugetlb page size · 5579fa68
      Yanan Wang authored
      
      
      If HUGETLB is configured in the host kernel, then we can know the system
      default hugetlb page size through *cat /proc/meminfo*. Otherwise, we will
      not see the information of hugetlb pages in file /proc/meminfo if it's not
      configured. So add a helper to determine whether HUGETLB is configured and
      then get the default page size by reading /proc/meminfo.
      
      This helper can be useful when a program wants to use the default hugetlb
      pages of the system and doesn't know the default page size.
      
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-8-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5579fa68
    • Yanan Wang's avatar
      KVM: selftests: Add a helper to get system configured THP page size · 3b70c4d1
      Yanan Wang authored
      
      
      If we want to have some tests about transparent hugepages, the system
      configured THP hugepage size should better be known by the tests, which
      can be used for kinds of alignment or guest memory accessing of vcpus...
      So it makes sense to add a helper to get the transparent hugepage size.
      
      With VM_MEM_SRC_ANONYMOUS_THP specified in vm_userspace_mem_region_add(),
      we now stat /sys/kernel/mm/transparent_hugepage to check whether THP is
      configured in the host kernel before madvise(). Based on this, we can also
      read file /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to get THP
      hugepage size.
      
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-7-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3b70c4d1
    • Yanan Wang's avatar
      KVM: selftests: Make a generic helper to get vm guest mode strings · 6436430e
      Yanan Wang authored
      
      
      For generality and conciseness, make an API which can be used in all
      kvm libs and selftests to get vm guest mode strings. And the index i
      is checked in the API in case of possiable faults.
      
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-6-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6436430e
    • Yanan Wang's avatar
      KVM: selftests: Print the errno besides error-string in TEST_ASSERT · c412d6ac
      Yanan Wang authored
      
      
      Print the errno besides error-string in TEST_ASSERT in the format of
      "errno=%d - %s" will explicitly indicate that the string is an error
      information. Besides, the errno is easier to be used for debugging
      than the error-string.
      
      Suggested-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20210330080856.14940-5-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c412d6ac
    • Yanan Wang's avatar
      tools/headers: sync headers of asm-generic/hugetlb_encode.h · fa76c775
      Yanan Wang authored
      
      
      This patch syncs contents of tools/include/asm-generic/hugetlb_encode.h
      and include/uapi/asm-generic/hugetlb_encode.h. Arch powerpc supports 16KB
      hugepages and ARM64 supports 32MB/512MB hugepages. The corresponding mmap
      flags have already been added in include/uapi/asm-generic/hugetlb_encode.h,
      but not tools/include/asm-generic/hugetlb_encode.h.
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210330080856.14940-2-wangyanan55@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa76c775
    • Haiwei Li's avatar
      KVM: vmx: add mismatched size assertions in vmcs_check32() · 870c575a
      Haiwei Li authored
      
      
      Add compile-time assertions in vmcs_check32() to disallow accesses to
      64-bit and 64-bit high fields via vmcs_{read,write}32().  Upper level KVM
      code should never do partial accesses to VMCS fields.  KVM handles the
      split accesses automatically in vmcs_{read,write}64() when running as a
      32-bit kernel.
      
      Reviewed-and-tested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHaiwei Li <lihaiwei@tencent.com>
      Message-Id: <20210409022456.23528-1-lihaiwei.kernel@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      870c575a
    • Sean Christopherson's avatar
      KVM: Add proper lockdep assertion in I/O bus unregister · 7c896d37
      Sean Christopherson authored
      
      
      Convert a comment above kvm_io_bus_unregister_dev() into an actual
      lockdep assertion, and opportunistically add curly braces to a multi-line
      for-loop.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210412222050.876100-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c896d37
    • Sean Christopherson's avatar
      KVM: Stop looking for coalesced MMIO zones if the bus is destroyed · 5d3c4c79
      Sean Christopherson authored
      Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
      fails to allocate memory for the new instance of the bus.  If it can't
      instantiate a new bus, unregister_dev() destroys all devices _except_ the
      target device.   But, it doesn't tell the caller that it obliterated the
      bus and invoked the destructor for all devices that were on the bus.  In
      the coalesced MMIO case, this can result in a deleted list entry
      dereference due to attempting to continue iterating on coalesced_zones
      after future entries (in the walk) have been deleted.
      
      Opportunistically add curly braces to the for-loop, which encompasses
      many lines but sneaks by without braces due to the guts being a single
      if statement.
      
      Fixes: f6588660
      
       ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210412222050.876100-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d3c4c79
    • Sean Christopherson's avatar
      KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCU · 2ee37574
      Sean Christopherson authored
      If allocating a new instance of an I/O bus fails when unregistering a
      device, wait to destroy the device until after all readers are guaranteed
      to see the new null bus.  Destroying devices before the bus is nullified
      could lead to use-after-free since readers expect the devices on their
      reference of the bus to remain valid.
      
      Fixes: f6588660
      
       ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210412222050.876100-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ee37574
    • Emanuele Giuseppe Esposito's avatar
      doc/virt/kvm: move KVM_CAP_PPC_MULTITCE in section 8 · 24e7475f
      Emanuele Giuseppe Esposito authored
      
      
      KVM_CAP_PPC_MULTITCE is a capability, not an ioctl.
      Therefore move it from section 4.97 to the new 8.31 (other capabilities).
      
      To fill the gap, move KVM_X86_SET_MSR_FILTER (was 4.126) to
      4.97, and shifted Xen-related ioctl (were 4.127 - 4.130) by
      one place (4.126 - 4.129).
      
      Also fixed minor typo in KVM_GET_MSR_INDEX_LIST ioctl description
      (section 4.3).
      
      Signed-off-by: default avatarEmanuele Giuseppe Esposito <eesposit@redhat.com>
      Message-Id: <20210316170814.64286-1-eesposit@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      24e7475f
    • Keqian Zhu's avatar
      KVM: x86: Remove unused function declaration · d90b15ed
      Keqian Zhu authored
      
      
      kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
      just remove it.
      
      Signed-off-by: default avatarKeqian Zhu <zhukeqian1@huawei.com>
      Message-Id: <20210406063504.17552-1-zhukeqian1@huawei.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d90b15ed
    • Sean Christopherson's avatar
      KVM: SVM: Enhance and clean up the vmcb tracking comment in pre_svm_run() · 44f1b558
      Sean Christopherson authored
      
      
      Explicitly document why a vmcb must be marked dirty and assigned a new
      asid when it will be run on a different cpu.  The "what" is relatively
      obvious, whereas the "why" requires reading the APM and/or KVM code.
      
      Opportunistically remove a spurious period and several unnecessary
      newlines in the comment.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210406171811.4043363-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44f1b558
    • Sean Christopherson's avatar
      KVM: SVM: Add a comment to clarify what vcpu_svm.vmcb points at · 554cf314
      Sean Christopherson authored
      
      
      Add a comment above the declaration of vcpu_svm.vmcb to call out that it
      is simply a shorthand for current_vmcb->ptr.  The myriad accesses to
      svm->vmcb are quite confusing without this crucial detail.
      
      No functional change intended.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210406171811.4043363-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      554cf314