Skip to content
  1. Sep 12, 2017
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras authored
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b
      
       ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Hold kvm->lock around call to kvmppc_update_lpcr · cf5f6f31
      Paul Mackerras authored
      Commit 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT
      guests on POWER9", 2017-01-30) added a call to kvmppc_update_lpcr()
      which doesn't hold the kvm->lock mutex around the call, as required.
      This adds the lock/unlock pair, and for good measure, includes
      the kvmppc_setup_partition_table() call in the locked region, since
      it is altering global state of the VM.
      
      This error appears not to have any fatal consequences for the host;
      the consequences would be that the VCPUs could end up running with
      different LPCR values, or an update to the LPCR value by userspace
      using the one_reg interface could get overwritten, or the update
      done by kvmhv_configure_mmu() could get overwritten.
      
      Cc: stable@vger.kernel.org # v4.10+
      Fixes: 468808bd
      
       ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      cf5f6f31
    • Benjamin Herrenschmidt's avatar
      KVM: PPC: Book3S HV: Don't access XIVE PIPR register using byte accesses · d222af07
      Benjamin Herrenschmidt authored
      The XIVE interrupt controller on POWER9 machines doesn't support byte
      accesses to any register in the thread management area other than the
      CPPR (current processor priority register).  In particular, when
      reading the PIPR (pending interrupt priority register), we need to
      do a 32-bit or 64-bit load.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: 2c4fb78f
      
       ("KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss")
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d222af07
  2. Sep 08, 2017
  3. Sep 07, 2017
  4. Sep 05, 2017
  5. Sep 01, 2017
  6. Aug 31, 2017
  7. Aug 30, 2017
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list · edd03602
      Paul Mackerras authored
      
      
      Al Viro pointed out that while one thread of a process is executing
      in kvm_vm_ioctl_create_spapr_tce(), another thread could guess the
      file descriptor returned by anon_inode_getfd() and close() it before
      the first thread has added it to the kvm->arch.spapr_tce_tables list.
      That highlights a more general problem: there is no mutual exclusion
      between writers to the spapr_tce_tables list, leading to the
      possibility of the list becoming corrupted, which could cause a
      host kernel crash.
      
      To fix the mutual exclusion problem, we add a mutex_lock/unlock
      pair around the list_del_rce in kvm_spapr_tce_release().  Also,
      this moves the call to anon_inode_getfd() inside the region
      protected by the kvm->lock mutex, after we have done the check for
      a duplicate LIOBN.  This means that if another thread does guess the
      file descriptor and closes it, its call to kvm_spapr_tce_release()
      will not do any harm because it will have to wait until the first
      thread has released kvm->lock.  With this, there are no failure
      points in kvm_vm_ioctl_create_spapr_tce() after the call to
      anon_inode_getfd().
      
      The other things that the second thread could do with the guessed
      file descriptor are to mmap it or to pass it as a parameter to a
      KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl on a KVM device fd.  An mmap
      call won't cause any harm because kvm_spapr_tce_mmap() and
      kvm_spapr_tce_fault() don't access the spapr_tce_tables list or
      the kvmppc_spapr_tce_table.list field, and the fields that they do use
      have been properly initialized by the time of the anon_inode_getfd()
      call.
      
      The KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl calls
      kvm_spapr_tce_attach_iommu_group(), which scans the spapr_tce_tables
      list looking for the kvmppc_spapr_tce_table struct corresponding to
      the fd given as the parameter.  Either it will find the new entry
      or it won't; if it doesn't, it just returns an error, and if it
      does, it will function normally.  So, in each case there is no
      harmful effect.
      
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      edd03602
  8. Aug 29, 2017
  9. Aug 28, 2017
  10. Aug 25, 2017
    • Jim Mattson's avatar
      kvm: nVMX: Validate the virtual-APIC address on nested VM-entry · 712b12d7
      Jim Mattson authored
      
      
      According to the SDM, if the "use TPR shadow" VM-execution control is
      1, bits 11:0 of the virtual-APIC address must be 0 and the address
      should set any bits beyond the processor's physical-address width.
      
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      712b12d7
    • Paul Mackerras's avatar
      KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce() · 47c5310a
      Paul Mackerras authored
      
      
      Nixiaoming pointed out that there is a memory leak in
      kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd()
      fails; the memory allocated for the kvmppc_spapr_tce_table struct
      is not freed, and nor are the pages allocated for the iommu
      tables.  In addition, we have already incremented the process's
      count of locked memory pages, and this doesn't get restored on
      error.
      
      David Hildenbrand pointed out that there is a race in that the
      function checks early on that there is not already an entry in the
      stt->iommu_tables list with the same LIOBN, but an entry with the
      same LIOBN could get added between then and when the new entry is
      added to the list.
      
      This fixes all three problems.  To simplify things, we now call
      anon_inode_getfd() before placing the new entry in the list.  The
      check for an existing entry is done while holding the kvm->lock
      mutex, immediately before adding the new entry to the list.
      Finally, on failure we now call kvmppc_account_memlimit to
      decrement the process's count of locked memory pages.
      
      Reported-by: default avatarNixiaoming <nixiaoming@huawei.com>
      Reported-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47c5310a
    • Paolo Bonzini's avatar
      KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state · 38cfd5e3
      Paolo Bonzini authored
      The host pkru is restored right after vcpu exit (commit 1be0e61c
      
      ), so
      KVM_GET_XSAVE will return the host PKRU value instead.  Fix this by
      using the guest PKRU explicitly in fill_xsave and load_xsave.  This
      part is based on a patch by Junkang Fu.
      
      The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state,
      because it could have been changed by userspace since the last time
      it was saved, so skip loading it in kvm_load_guest_fpu.
      
      Reported-by: default avatarJunkang Fu <junkang.fjk@alibaba-inc.com>
      Cc: Yang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61c
      
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      38cfd5e3
    • Paolo Bonzini's avatar
      KVM: x86: simplify handling of PKRU · b9dd21e1
      Paolo Bonzini authored
      
      
      Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
      simple comparison against the host value of the register.  The write of
      PKRU in addition can be skipped if the guest has not enabled the feature.
      Once we do this, we need not test OSPKE in the host anymore, because
      guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      The static PKU test is kept to elide the code on older CPUs.
      
      Suggested-by: default avatarYang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61c
      
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9dd21e1
    • Paolo Bonzini's avatar
      KVM: x86: block guest protection keys unless the host has them enabled · c469268c
      Paolo Bonzini authored
      If the host has protection keys disabled, we cannot read and write the
      guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0.  Block
      the PKU cpuid bit in that case.
      
      This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      Fixes: 1be0e61c
      
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c469268c
    • Wanpeng Li's avatar
      KVM: nVMX: Fix trying to cancel vmlauch/vmresume · bfcf83b1
      Wanpeng Li authored
      
      
      ------------[ cut here ]------------
      WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc4+ #11
      RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      Call Trace:
       ? kvm_multiple_exception+0x149/0x170 [kvm]
       ? handle_emulation_failure+0x79/0x230 [kvm]
       ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
       ? check_chain_key+0x137/0x1e0
       ? reexecute_instruction.part.168+0x130/0x130 [kvm]
       nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       vmx_queue_exception+0x197/0x300 [kvm_intel]
       kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
       ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
       ? preempt_count_sub+0x18/0xc0
       ? restart_apic_timer+0x17d/0x300 [kvm]
       ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
       ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
       kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]
      
      The flag "nested_run_pending", which can override the decision of which should run
      next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
      is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
      be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
      is especially intended to avoid switching  to L1 in the injection decision-point.
      
      This can be handled just like the other cases in vmx_check_nested_events, instead of
      having a special case in vmx_queue_exception.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bfcf83b1
    • Wanpeng Li's avatar
      KVM: X86: Fix loss of exception which has not yet been injected · 664f8e26
      Wanpeng Li authored
      
      
      vmx_complete_interrupts() assumes that the exception is always injected,
      so it can be dropped by kvm_clear_exception_queue().  However,
      an exception cannot be injected immediately if it is: 1) originally
      destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
      right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
      
      This patch applies to exceptions the same algorithm that is used for
      NMIs, replacing exception.reinject with "exception.injected" (equivalent
      to nmi_injected).
      
      exception.pending now represents an exception that is queued and whose
      side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
      If exception.pending is true, the exception might result in a nested
      vmexit instead, too (in which case the side effects must not be applied).
      
      exception.injected instead represents an exception that is going to be
      injected into the guest at the next vmentry.
      
      Reported-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      664f8e26
    • Wanpeng Li's avatar
      KVM: VMX: use kvm_event_needs_reinjection · 274bba52
      Wanpeng Li authored
      
      
      Use kvm_event_needs_reinjection() encapsulation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      274bba52