Commit 49d57592 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Provide a virtual cache topology to the guest to avoid
     inconsistencies with migration on heterogenous systems. Non secure
     software has no practical need to traverse the caches by set/way in
     the first place

   - Add support for taking stage-2 access faults in parallel. This was
     an accidental omission in the original parallel faults
     implementation, but should provide a marginal improvement to
     machines w/o FEAT_HAFDBS (such as hardware from the fruit company)

   - A preamble to adding support for nested virtualization to KVM,
     including vEL2 register state, rudimentary nested exception
     handling and masking unsupported features for nested guests

   - Fixes to the PSCI relay that avoid an unexpected host SVE trap when
     resuming a CPU when running pKVM

   - VGIC maintenance interrupt support for the AIC

   - Improvements to the arch timer emulation, primarily aimed at
     reducing the trap overhead of running nested

   - Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
     interest of CI systems

   - Avoid VM-wide stop-the-world operations when a vCPU accesses its
     own redistributor

   - Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
     exceptions in the host

   - Aesthetic and comment/kerneldoc fixes

   - Drop the vestiges of the old Columbia mailing list and add [Oliver]
     as co-maintainer

  RISC-V:

   - Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE

   - Correctly place the guest in S-mode after redirecting a trap to the
     guest

   - Redirect illegal instruction traps to guest

   - SBI PMU support for guest

  s390:

   - Sort out confusion between virtual and physical addresses, which
     currently are the same on s390

   - A new ioctl that performs cmpxchg on guest memory

   - A few fixes

  x86:

   - Change tdp_mmu to a read-only parameter

   - Separate TDP and shadow MMU page fault paths

   - Enable Hyper-V invariant TSC control

   - Fix a variety of APICv and AVIC bugs, some of them real-world, some
     of them affecting architecurally legal but unlikely to happen in
     practice

   - Mark APIC timer as expired if its in one-shot mode and the count
     underflows while the vCPU task was being migrated

   - Advertise support for Intel's new fast REP string features

   - Fix a double-shootdown issue in the emergency reboot code

   - Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
     SVM similar treatment to VMX

   - Update Xen's TSC info CPUID sub-leaves as appropriate

   - Add support for Hyper-V's extended hypercalls, where "support" at
     this point is just forwarding the hypercalls to userspace

   - Clean up the kvm->lock vs. kvm->srcu sequences when updating the
     PMU and MSR filters

   - One-off fixes and cleanups

   - Fix and cleanup the range-based TLB flushing code, used when KVM is
     running on Hyper-V

   - Add support for filtering PMU events using a mask. If userspace
     wants to restrict heavily what events the guest can use, it can now
     do so without needing an absurd number of filter entries

   - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
     support is disabled

   - Add PEBS support for Intel Sapphire Rapids

   - Fix a mostly benign overflow bug in SEV's
     send|receive_update_data()

   - Move several SVM-specific flags into vcpu_svm

  x86 Intel:

   - Handle NMI VM-Exits before leaving the noinstr region

   - A few trivial cleanups in the VM-Enter flows

   - Stop enabling VMFUNC for L1 purely to document that KVM doesn't
     support EPTP switching (or any other VM function) for L1

   - Fix a crash when using eVMCS's enlighted MSR bitmaps

  Generic:

   - Clean up the hardware enable and initialization flow, which was
     scattered around multiple arch-specific hooks. Instead, just let
     the arch code call into generic code. Both x86 and ARM should
     benefit from not having to fight common KVM code's notion of how to
     do initialization

   - Account allocations in generic kvm_arch_alloc_vm()

   - Fix a memory leak if coalesced MMIO unregistration fails

  selftests:

   - On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
     emit the correct hypercall instruction instead of relying on KVM to
     patch in VMMCALL

   - Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
  KVM: SVM: hyper-v: placate modpost section mismatch error
  KVM: x86/mmu: Make tdp_mmu_allowed static
  KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
  KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
  KVM: arm64: nv: Filter out unsupported features from ID regs
  KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
  KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
  KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
  KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
  KVM: arm64: nv: Handle SMCs taken from virtual EL2
  KVM: arm64: nv: Handle trapped ERET from virtual EL2
  KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
  KVM: arm64: nv: Support virtual EL2 exceptions
  KVM: arm64: nv: Handle HCR_EL2.NV system register traps
  KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
  KVM: arm64: nv: Add EL2 system registers to vcpu context
  KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
  KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
  KVM: arm64: nv: Introduce nested virtualization VCPU feature
  KVM: arm64: Use the S2 MMU context to iterate over S2 table
  ...
parents 01687e7c 45dd9bc7
Loading
Loading
Loading
Loading
+6 −1
Original line number Diff line number Diff line
@@ -2536,9 +2536,14 @@
			protected: nVHE-based mode with support for guests whose
				   state is kept private from the host.

			nested: VHE-based mode with support for nested
				virtualization. Requires at least ARMv8.3
				hardware.

			Defaults to VHE/nVHE based on hardware support. Setting
			mode to "protected" will disable kexec and hibernation
			for the host.
			for the host. "nested" is experimental and should be
			used with extreme caution.

	kvm-arm.vgic_v3_group0_trap=
			[KVM,ARM] Trap guest accesses to GICv3 group-0
+108 −16
Original line number Diff line number Diff line
@@ -3736,7 +3736,7 @@ The fields in each entry are defined as follows:
:Parameters: struct kvm_s390_mem_op (in)
:Returns: = 0 on success,
          < 0 on generic error (e.g. -EFAULT or -ENOMEM),
          > 0 if an exception occurred while walking the page tables
          16 bit program exception code if the access causes such an exception

Read or write data from/to the VM's memory.
The KVM_CAP_S390_MEM_OP_EXTENSION capability specifies what functionality is
@@ -3754,6 +3754,8 @@ Parameters are specified via the following structure::
		struct {
			__u8 ar;	/* the access register number */
			__u8 key;	/* access key, ignored if flag unset */
			__u8 pad1[6];	/* ignored */
			__u64 old_addr;	/* ignored if flag unset */
		};
		__u32 sida_offset; /* offset into the sida */
		__u8 reserved[32]; /* ignored */
@@ -3781,6 +3783,7 @@ Possible operations are:
  * ``KVM_S390_MEMOP_ABSOLUTE_WRITE``
  * ``KVM_S390_MEMOP_SIDA_READ``
  * ``KVM_S390_MEMOP_SIDA_WRITE``
  * ``KVM_S390_MEMOP_ABSOLUTE_CMPXCHG``

Logical read/write:
^^^^^^^^^^^^^^^^^^^
@@ -3829,7 +3832,7 @@ the checks required for storage key protection as one operation (as opposed to
user space getting the storage keys, performing the checks, and accessing
memory thereafter, which could lead to a delay between check and access).
Absolute accesses are permitted for the VM ioctl if KVM_CAP_S390_MEM_OP_EXTENSION
is > 0.
has the KVM_S390_MEMOP_EXTENSION_CAP_BASE bit set.
Currently absolute accesses are not permitted for VCPU ioctls.
Absolute accesses are permitted for non-protected guests only.

@@ -3837,7 +3840,26 @@ Supported flags:
  * ``KVM_S390_MEMOP_F_CHECK_ONLY``
  * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``

The semantics of the flags are as for logical accesses.
The semantics of the flags common with logical accesses are as for logical
accesses.

Absolute cmpxchg:
^^^^^^^^^^^^^^^^^

Perform cmpxchg on absolute guest memory. Intended for use with the
KVM_S390_MEMOP_F_SKEY_PROTECTION flag.
Instead of doing an unconditional write, the access occurs only if the target
location contains the value pointed to by "old_addr".
This is performed as an atomic cmpxchg with the length specified by the "size"
parameter. "size" must be a power of two up to and including 16.
If the exchange did not take place because the target value doesn't match the
old value, the value "old_addr" points to is replaced by the target value.
User space can tell if an exchange took place by checking if this replacement
occurred. The cmpxchg op is permitted for the VM ioctl if
KVM_CAP_S390_MEM_OP_EXTENSION has flag KVM_S390_MEMOP_EXTENSION_CAP_CMPXCHG set.

Supported flags:
  * ``KVM_S390_MEMOP_F_SKEY_PROTECTION``

SIDA read/write:
^^^^^^^^^^^^^^^^
@@ -4457,6 +4479,18 @@ not holding a previously reported uncorrected error).
:Parameters: struct kvm_s390_cmma_log (in, out)
:Returns: 0 on success, a negative value on error

Errors:

  ======     =============================================================
  ENOMEM     not enough memory can be allocated to complete the task
  ENXIO      if CMMA is not enabled
  EINVAL     if KVM_S390_CMMA_PEEK is not set but migration mode was not enabled
  EINVAL     if KVM_S390_CMMA_PEEK is not set but dirty tracking has been
             disabled (and thus migration mode was automatically disabled)
  EFAULT     if the userspace address is invalid or if no page table is
             present for the addresses (e.g. when using hugepages).
  ======     =============================================================

This ioctl is used to get the values of the CMMA bits on the s390
architecture. It is meant to be used in two scenarios:

@@ -4537,12 +4571,6 @@ mask is unused.

values points to the userspace buffer where the result will be stored.

This ioctl can fail with -ENOMEM if not enough memory can be allocated to
complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
-EFAULT if the userspace address is invalid or if no page table is
present for the addresses (e.g. when using hugepages).

4.108 KVM_S390_SET_CMMA_BITS
----------------------------

@@ -5005,6 +5033,15 @@ using this ioctl.
:Parameters: struct kvm_pmu_event_filter (in)
:Returns: 0 on success, -1 on error

Errors:

  ======     ============================================================
  EFAULT     args[0] cannot be accessed
  EINVAL     args[0] contains invalid data in the filter or filter events
  E2BIG      nevents is too large
  EBUSY      not enough memory to allocate the filter
  ======     ============================================================

::

  struct kvm_pmu_event_filter {
@@ -5016,14 +5053,69 @@ using this ioctl.
	__u64 events[0];
  };

This ioctl restricts the set of PMU events that the guest can program.
The argument holds a list of events which will be allowed or denied.
The eventsel+umask of each event the guest attempts to program is compared
against the events field to determine whether the guest should have access.
The events field only controls general purpose counters; fixed purpose
counters are controlled by the fixed_counter_bitmap.
This ioctl restricts the set of PMU events the guest can program by limiting
which event select and unit mask combinations are permitted.

The argument holds a list of filter events which will be allowed or denied.

Filter events only control general purpose counters; fixed purpose counters
are controlled by the fixed_counter_bitmap.

Valid values for 'flags'::

``0``

To use this mode, clear the 'flags' field.

In this mode each event will contain an event select + unit mask.

When the guest attempts to program the PMU the guest's event select +
unit mask is compared against the filter events to determine whether the
guest should have access.

``KVM_PMU_EVENT_FLAG_MASKED_EVENTS``
:Capability: KVM_CAP_PMU_EVENT_MASKED_EVENTS

In this mode each filter event will contain an event select, mask, match, and
exclude value.  To encode a masked event use::

  KVM_PMU_ENCODE_MASKED_ENTRY()

An encoded event will follow this layout::

  Bits   Description
  ----   -----------
  7:0    event select (low bits)
  15:8   umask match
  31:16  unused
  35:32  event select (high bits)
  36:54  unused
  55     exclude bit
  63:56  umask mask

When the guest attempts to program the PMU, these steps are followed in
determining if the guest should have access:

 1. Match the event select from the guest against the filter events.
 2. If a match is found, match the guest's unit mask to the mask and match
    values of the included filter events.
    I.e. (unit mask & mask) == match && !exclude.
 3. If a match is found, match the guest's unit mask to the mask and match
    values of the excluded filter events.
    I.e. (unit mask & mask) == match && exclude.
 4.
   a. If an included match is found and an excluded match is not found, filter
      the event.
   b. For everything else, do not filter the event.
 5.
   a. If the event is filtered and it's an allow list, allow the guest to
      program the event.
   b. If the event is filtered and it's a deny list, do not allow the guest to
      program the event.

No flags are defined yet, the field must be zero.
When setting a new pmu event filter, -EINVAL will be returned if any of the
unused fields are set or if any of the high bits (35:32) in the event
select are set when called on Intel.

Valid values for 'action'::

+4 −0
Original line number Diff line number Diff line
@@ -302,6 +302,10 @@ Allows userspace to start migration mode, needed for PGSTE migration.
Setting this attribute when migration mode is already active will have
no effects.

Dirty tracking must be enabled on all memslots, else -EINVAL is returned. When
dirty tracking is disabled on any memslot, migration mode is automatically
stopped.

:Parameters: none
:Returns:   -ENOMEM if there is not enough free memory to start migration mode;
	    -EINVAL if the state of the VM is invalid (e.g. no memory defined);
+16 −9
Original line number Diff line number Diff line
@@ -9,6 +9,8 @@ KVM Lock Overview

The acquisition orders for mutexes are as follows:

- cpus_read_lock() is taken outside kvm_lock

- kvm->lock is taken outside vcpu->mutex

- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
@@ -226,15 +228,10 @@ time it will be set using the Dirty tracking mechanism described above.
:Type:		mutex
:Arch:		any
:Protects:	- vm_list

``kvm_count_lock``
^^^^^^^^^^^^^^^^^^

:Type:		raw_spinlock_t
:Arch:		any
:Protects:	- hardware virtualization enable/disable
:Comment:	'raw' because hardware enabling/disabling must be atomic /wrt
		migration.
		- kvm_usage_count
		- hardware virtualization enable/disable
:Comment:	KVM also disables CPU hotplug via cpus_read_lock() during
		enable/disable.

``kvm->mn_invalidate_lock``
^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -292,3 +289,13 @@ time it will be set using the Dirty tracking mechanism described above.
		wakeup notification event since external interrupts from the
		assigned devices happens, we will find the vCPU on the list to
		wakeup.

``vendor_module_lock``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:Type:		mutex
:Arch:		x86
:Protects:	loading a vendor module (kvm_amd or kvm_intel)
:Comment:	Exists because using kvm_lock leads to deadlock.  cpu_hotplug_lock is
    taken outside of kvm_lock, e.g. in KVM's CPU online/offline callbacks, and
    many operations need to take cpu_hotplug_lock when loading a vendor module,
    e.g. updating static calls.
+11 −0
Original line number Diff line number Diff line
@@ -37,3 +37,14 @@ Nested virtualization features
------------------------------

TBD

x2APIC
------
When KVM_X2APIC_API_USE_32BIT_IDS is enabled, KVM activates a hack/quirk that
allows sending events to a single vCPU using its x2APIC ID even if the target
vCPU has legacy xAPIC enabled, e.g. to bring up hotplugged vCPUs via INIT-SIPI
on VMs with > 255 vCPUs.  A side effect of the quirk is that, if multiple vCPUs
have the same physical APIC ID, KVM will deliver events targeting that APIC ID
only to the vCPU with the lowest vCPU ID.  If KVM_X2APIC_API_USE_32BIT_IDS is
not enabled, KVM follows x86 architecture when processing interrupts (all vCPUs
matching the target APIC ID receive the interrupt).
Loading