Skip to content
  1. Jan 31, 2017
    • David Gibson's avatar
      KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size · f98a8bf9
      David Gibson authored
      The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
      table (HPT) that userspace expects a guest VM to have, and is also used to
      clear that HPT when necessary (e.g. guest reboot).
      
      At present, once the ioctl() is called for the first time, the HPT size can
      never be changed thereafter - it will be cleared but always sized as from
      the first call.
      
      With upcoming HPT resize implementation, we're going to need to allow
      userspace to resize the HPT at reset (to change it back to the default size
      if the guest changed it).
      
      So, we need to allow this ioctl() to change the HPT size.
      
      This patch also updates Documentation/virtual/kvm/api.txt to reflect
      the new behaviour.  In fact the documentation was already slightly
      incorrect since 572abd56
      
       "KVM: PPC: Book3S HV: Don't fall back to
      smaller HPT size in allocation ioctl"
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      f98a8bf9
    • David Gibson's avatar
      KVM: PPC: Book3S HV: Split HPT allocation from activation · aae0777f
      David Gibson authored
      
      
      Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
      and sets it up as the active page table for a VM.  For the upcoming HPT
      resize implementation we're going to want to allocate HPTs separately from
      activating them.
      
      So, split the allocation itself out into kvmppc_allocate_hpt() and perform
      the activation with a new kvmppc_set_hpt() function.  Likewise we split
      kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
      which unsets it as an active HPT, then frees it.
      
      We also move the logic to fall back to smaller HPT sizes if the first try
      fails into the single caller which used that behaviour,
      kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
      that previously if the initial attempt at CMA allocation failed, we would
      fall back to attempting smaller sizes with the page allocator.  Now, we
      try first CMA, then the page allocator at each size.  As far as I can tell
      this change should be harmless.
      
      To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
      call to kvmppc_free_lpid() that was there, we move to the single caller.
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      aae0777f
    • David Gibson's avatar
      KVM: PPC: Book3S HV: Don't store values derivable from HPT order · 3d089f84
      David Gibson authored
      
      
      Currently the kvm_hpt_info structure stores the hashed page table's order,
      and also the number of HPTEs it contains and a mask for its size.  The
      last two can be easily derived from the order, so remove them and just
      calculate them as necessary with a couple of helper inlines.
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarThomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3d089f84
    • David Gibson's avatar
      KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure · 3f9d4f5a
      David Gibson authored
      
      
      Currently, the powerpc kvm_arch structure contains a number of variables
      tracking the state of the guest's hashed page table (HPT) in KVM HV.  This
      patch gathers them all together into a single kvm_hpt_info substructure.
      This makes life more convenient for the upcoming HPT resizing
      implementation.
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3f9d4f5a
    • David Gibson's avatar
      KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarity · db9a290d
      David Gibson authored
      
      
      The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
      all obvious from the name.  In practice kvmppc_alloc_hpt() allocates an HPT
      by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
      it with CMA only.
      
      To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
      Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarThomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      db9a290d
    • David Gibson's avatar
      KVM: PPC: Book3S HV: HPT resizing documentation and reserved numbers · ef1ead0c
      David Gibson authored
      
      
      This adds a new powerpc-specific KVM_CAP_SPAPR_RESIZE_HPT capability to
      advertise whether KVM is capable of handling the PAPR extensions for
      resizing the hashed page table during guest runtime.  It also adds
      definitions for two new VM ioctl()s to implement this extension, and
      documentation of the same.
      
      Note that, HPT resizing is already possible with KVM PR without kernel
      modification, since the HPT is managed within userspace (qemu).  The
      capability defined here will only be set where an in-kernel implementation
      of resizing is necessary, i.e. for KVM HV.  To determine if the userspace
      resize implementation can be used, it's necessary to check
      KVM_CAP_PPC_ALLOC_HTAB.  Unfortunately older kernels incorrectly set
      KVM_CAP_PPC_ALLOC_HTAB even with KVM PR.  If userspace it want to support
      resizing with KVM PR on such kernels, it will need a workaround.
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ef1ead0c
    • David Gibson's avatar
      Documentation: Correct duplicate section number in kvm/api.txt · ccc4df4e
      David Gibson authored
      
      
      Both KVM_CREATE_SPAPR_TCE_64 and KVM_REINJECT_CONTROL have section number
      4.98 in Documentation/virtual/kvm/api.txt, presumably due to a naive merge.
      This corrects the duplication.
      
      [paulus@ozlabs.org - correct section numbers for following sections,
       KVM_PPC_CONFIGURE_V3_MMU and KVM_PPC_GET_RMMU_INFO, as well.]
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ccc4df4e
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · 167c76e0
      Paul Mackerras authored
      
      
      This merges in the POWER9 radix MMU host and guest support, which
      was put into a topic branch because it touches both powerpc and
      KVM code.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      167c76e0
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Enable radix guest support · 8cf4ecc0
      Paul Mackerras authored
      
      
      This adds a few last pieces of the support for radix guests:
      
      * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
        KVM_PPC_GET_RMMU_INFO ioctls for radix guests
      
      * On POWER9, allow secondary threads to be on/off-lined while guests
        are running.
      
      * Set up LPCR and the partition table entry for radix guests.
      
      * Don't allocate the rmap array in the kvm_memory_slot structure
        on radix.
      
      * Don't try to initialize the HPT for radix guests, since they don't
        have an HPT.
      
      * Take out the code that prevents the HV KVM module from
        initializing on radix hosts.
      
      At this stage, we only support radix guests if the host is running
      in radix mode, and only support HPT guests if the host is running in
      HPT mode.  Thus a guest cannot switch from one mode to the other,
      which enables some simplifications.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8cf4ecc0
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Invalidate ERAT on guest entry/exit for POWER9 DD1 · f11f6f79
      Paul Mackerras authored
      
      
      On POWER9 DD1, we need to invalidate the ERAT (effective to real
      address translation cache) when changing the PIDR register, which
      we do as part of guest entry and exit.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f11f6f79
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Allow guest exit path to have MMU on · 53af3ba2
      Paul Mackerras authored
      
      
      If we allow LPCR[AIL] to be set for radix guests, then interrupts from
      the guest to the host can be delivered by the hardware with relocation
      on, and thus the code path starting at kvmppc_interrupt_hv can be
      executed in virtual mode (MMU on) for radix guests (previously it was
      only ever executed in real mode).
      
      Most of the code is indifferent to whether the MMU is on or off, but
      the calls to OPAL that use the real-mode OPAL entry code need to
      be switched to use the virtual-mode code instead.  The affected
      calls are the calls to the OPAL XICS emulation functions in
      kvmppc_read_one_intr() and related functions.  We test the MSR[IR]
      bit to detect whether we are in real or virtual mode, and call the
      opal_rm_* or opal_* function as appropriate.
      
      The other place that depends on the MMU being off is the optimization
      where the guest exit code jumps to the external interrupt vector or
      hypervisor doorbell interrupt vector, or returns to its caller (which
      is __kvmppc_vcore_entry).  If the MMU is on and we are returning to
      the caller, then we don't need to use an rfid instruction since the
      MMU is already on; a simple blr suffices.  If there is an external
      or hypervisor doorbell interrupt to handle, we branch to the
      relocation-on version of the interrupt vector.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      53af3ba2
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement · a29ebeaf
      Paul Mackerras authored
      
      
      With radix, the guest can do TLB invalidations itself using the tlbie
      (global) and tlbiel (local) TLB invalidation instructions.  Linux guests
      use local TLB invalidations for translations that have only ever been
      accessed on one vcpu.  However, that doesn't mean that the translations
      have only been accessed on one physical cpu (pcpu) since vcpus can move
      around from one pcpu to another.  Thus a tlbiel might leave behind stale
      TLB entries on a pcpu where the vcpu previously ran, and if that task
      then moves back to that previous pcpu, it could see those stale TLB
      entries and thus access memory incorrectly.  The usual symptom of this
      is random segfaults in userspace programs in the guest.
      
      To cope with this, we detect when a vcpu is about to start executing on
      a thread in a core that is a different core from the last time it
      executed.  If that is the case, then we mark the core as needing a
      TLB flush and then send an interrupt to any thread in the core that is
      currently running a vcpu from the same guest.  This will get those vcpus
      out of the guest, and the first one to re-enter the guest will do the
      TLB flush.  The reason for interrupting the vcpus executing on the old
      core is to cope with the following scenario:
      
      	CPU 0			CPU 1			CPU 4
      	(core 0)			(core 0)			(core 1)
      
      	VCPU 0 runs task X      VCPU 1 runs
      	core 0 TLB gets
      	entries from task X
      	VCPU 0 moves to CPU 4
      							VCPU 0 runs task X
      							Unmap pages of task X
      							tlbiel
      
      				(still VCPU 1)			task X moves to VCPU 1
      				task X runs
      				task X sees stale TLB
      				entries
      
      That is, as soon as the VCPU starts executing on the new core, it
      could unmap and tlbiel some page table entries, and then the task
      could migrate to one of the VCPUs running on the old core and
      potentially see stale TLB entries.
      
      Since the TLB is shared between all the threads in a core, we only
      use the bit of kvm->arch.need_tlb_flush corresponding to the first
      thread in the core.  To ensure that we don't have a window where we
      can miss a flush, this moves the clearing of the bit from before the
      actual flush to after it.  This way, two threads might both do the
      flush, but we prevent the situation where one thread can enter the
      guest before the flush is finished.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a29ebeaf
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Make HPT-specific hypercalls return error in radix mode · 65dae540
      Paul Mackerras authored
      
      
      If the guest is in radix mode, then it doesn't have a hashed page
      table (HPT), so all of the hypercalls that manipulate the HPT can't
      work and should return an error.  This adds checks to make them
      return H_FUNCTION ("function not supported").
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      65dae540
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Implement dirty page logging for radix guests · 8f7b79b8
      Paul Mackerras authored
      
      
      This adds code to keep track of dirty pages when requested (that is,
      when memslot->dirty_bitmap is non-NULL) for radix guests.  We use the
      dirty bits in the PTEs in the second-level (partition-scoped) page
      tables, together with a bitmap of pages that were dirty when their
      PTE was invalidated (e.g., when the page was paged out).  This bitmap
      is stored in the first half of the memslot->dirty_bitmap area, and
      kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
      bitmap that gets returned to userspace.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8f7b79b8
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests · 01756099
      Paul Mackerras authored
      
      
      This adapts our implementations of the MMU notifier callbacks
      (unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
      to call radix functions when the guest is using radix.  These
      implementations are much simpler than for HPT guests because we
      have only one PTE to deal with, so we don't need to traverse
      rmap chains.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      01756099
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Page table construction and page faults for radix guests · 5a319350
      Paul Mackerras authored
      
      
      This adds the code to construct the second-level ("partition-scoped" in
      architecturese) page tables for guests using the radix MMU.  Apart from
      the PGD level, which is allocated when the guest is created, the rest
      of the tree is all constructed in response to hypervisor page faults.
      
      As well as hypervisor page faults for missing pages, we also get faults
      for reference/change (RC) bits needing to be set, as well as various
      other error conditions.  For now, we only set the R or C bit in the
      guest page table if the same bit is set in the host PTE for the
      backing page.
      
      This code can take advantage of the guest being backed with either
      transparent or ordinary 2MB huge pages, and insert 2MB page entries
      into the guest page tables.  There is no support for 1GB huge pages
      yet.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5a319350
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests · f4c51f84
      Paul Mackerras authored
      
      
      This adds code to  branch around the parts that radix guests don't
      need - clearing and loading the SLB with the guest SLB contents,
      saving the guest SLB contents on exit, and restoring the host SLB
      contents.
      
      Since the host is now using radix, we need to save and restore the
      host value for the PID register.
      
      On hypervisor data/instruction storage interrupts, we don't do the
      guest HPT lookup on radix, but just save the guest physical address
      for the fault (from the ASDR register) in the vcpu struct.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f4c51f84
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add basic infrastructure for radix guests · 9e04ba69
      Paul Mackerras authored
      
      
      This adds a field in struct kvm_arch and an inline helper to
      indicate whether a guest is a radix guest or not, plus a new file
      to contain the radix MMU code, which currently contains just a
      translate function which knows how to traverse the guest page
      tables to translate an address.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9e04ba69
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use ASDR for HPT guests on POWER9 · ef8c640c
      Paul Mackerras authored
      
      
      POWER9 adds a register called ASDR (Access Segment Descriptor
      Register), which is set by hypervisor data/instruction storage
      interrupts to contain the segment descriptor for the address
      being accessed, assuming the guest is using HPT translation.
      (For radix guests, it contains the guest real address of the
      access.)
      
      Thus, for HPT guests on POWER9, we can use this register rather
      than looking up the SLB with the slbfee. instruction.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ef8c640c
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9 · 468808bd
      Paul Mackerras authored
      
      
      This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
      for HPT guests on POWER9.  With this, we can return 1 for the
      KVM_CAP_PPC_MMU_HASH_V3 capability.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      468808bd
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU · c9270132
      Paul Mackerras authored
      
      
      This adds two capabilities and two ioctls to allow userspace to
      find out about and configure the POWER9 MMU in a guest.  The two
      capabilities tell userspace whether KVM can support a guest using
      the radix MMU, or using the hashed page table (HPT) MMU with a
      process table and segment tables.  (Note that the MMUs in the
      POWER9 processor cores do not use the process and segment tables
      when in HPT mode, but the nest MMU does).
      
      The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
      whether a guest will use the radix MMU or the HPT MMU, and to
      specify the size and location (in guest space) of the process
      table.
      
      The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
      the radix MMU.  It returns a list of supported radix tree geometries
      (base page size and number of bits indexed at each level of the
      radix tree) and the encoding used to specify the various page
      sizes for the TLB invalidate entry instruction.
      
      Initially, both capabilities return 0 and the ioctls return -EINVAL,
      until the necessary infrastructure for them to operate correctly
      is added.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c9270132
    • Paul Mackerras's avatar
      powerpc/64: Allow for relocation-on interrupts from guest to host · bc355125
      Paul Mackerras authored
      
      
      With host and guest both using radix translation, it is feasible
      for the host to take interrupts that come from the guest with
      relocation on, and that is in fact what the POWER9 hardware will
      do when LPCR[AIL] = 3.  All such interrupts use HSRR0/1 not SRR0/1
      except for system call with LEV=1 (hcall).
      
      Therefore this adds the KVM tests to the _HV variants of the
      relocation-on interrupt handlers, and adds the KVM test to the
      relocation-on system call entry point.
      
      We also instantiate the relocation-on versions of the hypervisor
      data storage and instruction interrupt handlers, since these can
      occur with relocation on in radix guests.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bc355125
    • Paul Mackerras's avatar
      powerpc/64: Make type of partition table flush depend on partition type · 16ed1416
      Paul Mackerras authored
      
      
      When changing a partition table entry on POWER9, we do a particular
      form of the tlbie instruction which flushes all TLBs and caches of
      the partition table for a given logical partition ID (LPID).
      This instruction has a field in the instruction word, labelled R
      (radix), which should be 1 if the partition was previously a radix
      partition and 0 if it was a HPT partition.  This implements that
      logic.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      16ed1416
    • Paul Mackerras's avatar
      powerpc/64: Export pgtable_cache and pgtable_cache_add for KVM · ba9b399a
      Paul Mackerras authored
      
      
      This exports the pgtable_cache array and the pgtable_cache_add
      function so that HV KVM can use them for allocating radix page
      tables for guests.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ba9b399a
    • Paul Mackerras's avatar
      powerpc/64: More definitions for POWER9 · dbcbfee0
      Paul Mackerras authored
      
      
      This adds definitions for bits in the DSISR register which are used
      by POWER9 for various translation-related exception conditions, and
      for some more bits in the partition table entry that will be needed
      by KVM.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      dbcbfee0
    • Paul Mackerras's avatar
      powerpc/64: Enable use of radix MMU under hypervisor on POWER9 · cc3d2940
      Paul Mackerras authored
      
      
      To use radix as a guest, we first need to tell the hypervisor via
      the ibm,client-architecture call first that we support POWER9 and
      architecture v3.00, and that we can do either radix or hash and
      that we would like to choose later using an hcall (the
      H_REGISTER_PROC_TBL hcall).
      
      Then we need to check whether the hypervisor agreed to us using
      radix.  We need to do this very early on in the kernel boot process
      before any of the MMU initialization is done.  If the hypervisor
      doesn't agree, we can't use radix and therefore clear the radix
      MMU feature bit.
      
      Later, when we have set up our process table, which points to the
      radix tree for each process, we need to install that using the
      H_REGISTER_PROC_TBL hcall.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      cc3d2940
    • Paul Mackerras's avatar
      powerpc/pseries: Fixes for the "ibm,architecture-vec-5" options · 3f4ab2f8
      Paul Mackerras authored
      
      
      This fixes the byte index values for some of the option bits in
      the "ibm,architectur-vec-5" property. The "platform facilities options"
      bits are in byte 17 not byte 14, so the upper 8 bits of their
      definitions need to be 0x11 not 0x0E. The "sub processor support" option
      is in byte 21 not byte 15.
      
      Note none of these options are actually looked up in
      "ibm,architecture-vec-5" at this time, so there is no bug.
      
      When checking whether option bits are set, we should check that
      the offset of the byte being checked is less than the vector
      length that we got from the hypervisor.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3f4ab2f8
    • Paul Mackerras's avatar
      powerpc/64: Don't try to use radix MMU under a hypervisor · 18569c1f
      Paul Mackerras authored
      Currently, if the kernel is running on a POWER9 processor under a
      hypervisor, it will try to use the radix MMU even though it doesn't have
      the necessary code to use radix under a hypervisor (it doesn't negotiate
      use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). The
      result is that the guest kernel will crash when it tries to turn on the
      MMU.
      
      This fixes it by looking for the /chosen/ibm,architecture-vec-5
      property, and if it exists, clears the radix MMU feature bit, before we
      decide whether to initialize for radix or HPT. This property is created
      by the hypervisor as a result of the guest calling the
      ibm,client-architecture-support method to indicate its capabilities, so
      it will indicate whether the hypervisor agreed to us using radix.
      
      Systems without a hypervisor may have this property also (for example,
      skiboot creates it), so we check the HV bit in the MSR to see whether we
      are running as a guest or not. If we are in hypervisor mode, then we can
      do whatever we like including using the radix MMU.
      
      The reason for using this property is that in future, when we have
      support for using radix under a hypervisor, we will need to check this
      property to see whether the hypervisor agreed to us using radix.
      
      Fixes: 2bfd65e4
      
       ("powerpc/mm/radix: Add radix callbacks for early init routines")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      18569c1f
    • Nicholas Piggin's avatar
      KVM: PPC: Book3S: 64-bit CONFIG_RELOCATABLE support for interrupts · a97a65d5
      Nicholas Piggin authored
      
      
      64-bit Book3S exception handlers must find the dynamic kernel base
      to add to the target address when branching beyond __end_interrupts,
      in order to support kernel running at non-0 physical address.
      
      Support this in KVM by branching with CTR, similarly to regular
      interrupt handlers. The guest CTR saved in HSTATE_SCRATCH1 and
      restored after the branch.
      
      Without this, the host kernel hangs and crashes randomly when it is
      running at a non-0 address and a KVM guest is started.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a97a65d5
  2. Jan 27, 2017
    • Thomas Huth's avatar
      KVM: PPC: Book3S PR: Refactor program interrupt related code into separate function · fcd4f3c6
      Thomas Huth authored
      
      
      The function kvmppc_handle_exit_pr() is quite huge and thus hard to read,
      and even contains a "spaghetti-code"-like goto between the different case
      labels of the big switch statement. This can be made much more readable
      by moving the code related to injecting program interrupts / instruction
      emulation into a separate function instead.
      
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      fcd4f3c6
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpu · 8464c884
      Paul Mackerras authored
      
      
      The H_PROD hypercall is supposed to wake up an idle vcpu.  We have
      an implementation, but because Linux doesn't use it except when
      doing cpu hotplug, it was never tested properly.  AIX does use it,
      and reported it broken.  It turns out we were waking the wrong
      vcpu (the one doing H_PROD, not the target of the prod) and we
      weren't handling the case where the target needs an IPI to wake
      it.  Fix it by using the existing kvmppc_fast_vcpu_kick_hv()
      function, which is intended for this kind of thing, and by using
      the target vcpu not the current vcpu.
      
      We were also not looking at the prodded flag when checking whether a
      ceded vcpu should wake up, so this adds checks for the prodded flag
      alongside the checks for pending exceptions.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      8464c884
    • Nicholas Piggin's avatar
      KVM: PPC: Book3S: Move 64-bit KVM interrupt handler out from alt section · 7ede5317
      Nicholas Piggin authored
      
      
      A subsequent patch to make KVM handlers relocation-safe makes them
      unusable from within alt section "else" cases (due to the way fixed
      addresses are taken from within fixed section head code).
      
      Stop open-coding the KVM handlers, and add them both as normal. A more
      optimal fix may be to allow some level of alternate feature patching in
      the exception macros themselves, but for now this will do.
      
      The TRAMP_KVM handlers must be moved to the "virt" fixed section area
      (name is arbitrary) in order to be closer to .text and avoid the dreaded
      "relocation truncated to fit" error.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7ede5317
    • Nicholas Piggin's avatar
      KVM: PPC: Book3S: Change interrupt call to reduce scratch space use on HV · d3918e7f
      Nicholas Piggin authored
      
      
      Change the calling convention to put the trap number together with
      CR in two halves of r12, which frees up HSTATE_SCRATCH2 in the HV
      handler.
      
      The 64-bit PR handler entry translates the calling convention back
      to match the previous call convention (i.e., shared with 32-bit), for
      simplicity.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d3918e7f
    • Li Zhong's avatar
      KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend · 21acd0e4
      Li Zhong authored
      
      
      This patch improves the code that takes lock twice to check the resend flag
      and do the actual resending, by checking the resend flag locklessly, and
      add a boolean parameter check_resend to icp_[rm_]deliver_irq(), so the
      resend flag can be checked in the lock when doing the delivery.
      
      We need make sure when we clear the ics's bit in the icp's resend_map, we
      don't miss the resend flag of the irqs that set the bit. It could be
      ordered through the barrier in test_and_clear_bit(), and a newly added
      wmb between setting irq's resend flag, and icp's resend_map.
      
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      21acd0e4
    • Li Zhong's avatar
      KVM: PPC: Book 3S: XICS: Implement ICS P/Q states · 17d48610
      Li Zhong authored
      
      
      This patch implements P(Presented)/Q(Queued) states for ICS irqs.
      
      When the interrupt is presented, set P. Present if P was not set.
      If P is already set, don't present again, set Q.
      When the interrupt is EOI'ed, move Q into P (and clear Q). If it is
      set, re-present.
      
      The asserted flag used by LSI is also incorporated into the P bit.
      
      When the irq state is saved, P/Q bits are also saved, they need some
      qemu modifications to be recognized and passed around to be restored.
      KVM_XICS_PENDING bit set and saved should also indicate
      KVM_XICS_PRESENTED bit set and saved. But it is possible some old
      code doesn't have/recognize the P bit, so when we restore, we set P
      for PENDING bit, too.
      
      The idea and much of the code come from Ben.
      
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      17d48610
    • Li Zhong's avatar
      KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends · bf5a71d5
      Li Zhong authored
      
      
      It is possible that in the following order, one irq is resent twice:
      
              CPU 1                                   CPU 2
      
      ics_check_resend()
        lock ics_lock
          see resend set
        unlock ics_lock
                                             /* change affinity of the irq */
                                             kvmppc_xics_set_xive()
                                               write_xive()
                                                 lock ics_lock
                                                   see resend set
                                                 unlock ics_lock
      
                                               icp_deliver_irq() /* resend */
      
        icp_deliver_irq() /* resend again */
      
      It doesn't have any user-visible effect at present, but needs to be avoided
      when the following patch implementing the P/Q stuff is applied.
      
      This patch clears the resend flag before releasing the ics lock, when we
      know we will do a re-delivery after checking the flag, or setting the flag.
      
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      bf5a71d5
    • Li Zhong's avatar
      KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter · 37451bc9
      Li Zhong authored
      Some counters are added in Commit 6e0365b7
      
       ("KVM: PPC: Book3S HV:
      Add ICP real mode counters"), to provide some performance statistics to
      determine whether further optimizing is needed for real mode functions.
      
      The n_reject counter counts how many times ICP rejects an irq because of
      priority in real mode. The redelivery of an lsi that is still asserted
      after eoi doesn't fall into this category, so the increasement there is
      removed.
      
      Also, it needs to be increased in icp_rm_deliver_irq() if it rejects
      another one.
      
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      37451bc9
    • Li Zhong's avatar
      KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT · 5efa6605
      Li Zhong authored
      Commit b0221556
      
       ("KVM: PPC: Book3S HV: Move virtual mode ICP functions
       to real-mode") removed the setting of the XICS_RM_REJECT flag. And
      since that commit, nothing else sets the flag any more, so we can remove
      the flag and the remaining code that handles it, including the counter
      that counts how many times it get set.
      
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      5efa6605
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Don't try to signal cpu -1 · 3deda5e5
      Paul Mackerras authored
      If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on
      any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it
      happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with
      -1 as the cpu argument.  Although this is not meaningful, in the past,
      before commit 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs
      to other cores on POWER9", 2016-11-18), it was harmless because CPU
      -1 is not in the same core as any real CPU thread.  On a POWER9,
      however, we don't do the "same core" check, so we were trying to
      do a msgsnd to thread -1, which is invalid.  To avoid this, we add
      a check to see that vcpu->arch.thread_cpu is >= 0 before calling
      kvmppc_ipi_thread() with it.  Since vcpu->arch.thread_vcpu can change
      asynchronously, we use READ_ONCE to ensure that the value we check is
      the same value that we use as the argument to kvmppc_ipi_thread().
      
      Fixes: 1704a81c
      
       ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3deda5e5
  3. Jan 18, 2017
    • Piotr Luc's avatar
      kvm: x86: Expose Intel VPOPCNTDQ feature to guest · a17f3227
      Piotr Luc authored
      
      
      Vector population count instructions for dwords and qwords are to be
      used in future Intel Xeon & Xeon Phi processors. The bit 14 of
      CPUID[level:0x07, ECX] indicates that the new instructions are
      supported by a processor.
      
      The spec can be found in the Intel Software Developer Manual (SDM)
      or in the Instruction Set Extensions Programming Reference (ISE).
      
      Signed-off-by: default avatarPiotr Luc <piotr.luc@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      a17f3227