Commit 8fa590bf authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull kvm updates from Paolo Bonzini:
 "ARM64:

   - Enable the per-vcpu dirty-ring tracking mechanism, together with an
     option to keep the good old dirty log around for pages that are
     dirtied by something other than a vcpu.

   - Switch to the relaxed parallel fault handling, using RCU to delay
     page table reclaim and giving better performance under load.

   - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
     option, which multi-process VMMs such as crosvm rely on (see merge
     commit 382b5b87: "Fix a number of issues with MTE, such as
     races on the tags being initialised vs the PG_mte_tagged flag as
     well as the lack of support for VM_SHARED when KVM is involved.
     Patches from Catalin Marinas and Peter Collingbourne").

   - Merge the pKVM shadow vcpu state tracking that allows the
     hypervisor to have its own view of a vcpu, keeping that state
     private.

   - Add support for the PMUv3p5 architecture revision, bringing support
     for 64bit counters on systems that support it, and fix the
     no-quite-compliant CHAIN-ed counter support for the machines that
     actually exist out there.

   - Fix a handful of minor issues around 52bit VA/PA support (64kB
     pages only) as a prefix of the oncoming support for 4kB and 16kB
     pages.

   - Pick a small set of documentation and spelling fixes, because no
     good merge window would be complete without those.

  s390:

   - Second batch of the lazy destroy patches

   - First batch of KVM changes for kernel virtual != physical address
     support

   - Removal of a unused function

  x86:

   - Allow compiling out SMM support

   - Cleanup and documentation of SMM state save area format

   - Preserve interrupt shadow in SMM state save area

   - Respond to generic signals during slow page faults

   - Fixes and optimizations for the non-executable huge page errata
     fix.

   - Reprogram all performance counters on PMU filter change

   - Cleanups to Hyper-V emulation and tests

   - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
     guest running on top of a L1 Hyper-V hypervisor)

   - Advertise several new Intel features

   - x86 Xen-for-KVM:

      - Allow the Xen runstate information to cross a page boundary

      - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

      - Add support for 32-bit guests in SCHEDOP_poll

   - Notable x86 fixes and cleanups:

      - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

      - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
        a few years back when eliminating unnecessary barriers when
        switching between vmcs01 and vmcs02.

      - Clean up vmread_error_trampoline() to make it more obvious that
        params must be passed on the stack, even for x86-64.

      - Let userspace set all supported bits in MSR_IA32_FEAT_CTL
        irrespective of the current guest CPUID.

      - Fudge around a race with TSC refinement that results in KVM
        incorrectly thinking a guest needs TSC scaling when running on a
        CPU with a constant TSC, but no hardware-enumerated TSC
        frequency.

      - Advertise (on AMD) that the SMM_CTL MSR is not supported

      - Remove unnecessary exports

  Generic:

   - Support for responding to signals during page faults; introduces
     new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks

  Selftests:

   - Fix an inverted check in the access tracking perf test, and restore
     support for asserting that there aren't too many idle pages when
     running on bare metal.

   - Fix build errors that occur in certain setups (unsure exactly what
     is unique about the problematic setup) due to glibc overriding
     static_assert() to a variant that requires a custom message.

   - Introduce actual atomics for clear/set_bit() in selftests

   - Add support for pinning vCPUs in dirty_log_perf_test.

   - Rename the so called "perf_util" framework to "memstress".

   - Add a lightweight psuedo RNG for guest use, and use it to randomize
     the access pattern and write vs. read percentage in the memstress
     tests.

   - Add a common ucall implementation; code dedup and pre-work for
     running SEV (and beyond) guests in selftests.

   - Provide a common constructor and arch hook, which will eventually
     be used by x86 to automatically select the right hypercall (AMD vs.
     Intel).

   - A bunch of added/enabled/fixed selftests for ARM64, covering
     memslots, breakpoints, stage-2 faults and access tracking.

   - x86-specific selftest changes:

      - Clean up x86's page table management.

      - Clean up and enhance the "smaller maxphyaddr" test, and add a
        related test to cover generic emulation failure.

      - Clean up the nEPT support checks.

      - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

      - Fix an ordering issue in the AMX test introduced by recent
        conversions to use kvm_cpu_has(), and harden the code to guard
        against similar bugs in the future. Anything that tiggers
        caching of KVM's supported CPUID, kvm_cpu_has() in this case,
        effectively hides opt-in XSAVE features if the caching occurs
        before the test opts in via prctl().

  Documentation:

   - Remove deleted ioctls from documentation

   - Clean up the docs for the x86 MSR filter.

   - Various fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
  KVM: x86: Add proper ReST tables for userspace MSR exits/flags
  KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
  KVM: arm64: selftests: Align VA space allocator with TTBR0
  KVM: arm64: Fix benign bug with incorrect use of VA_BITS
  KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
  KVM: x86: Advertise that the SMM_CTL MSR is not supported
  KVM: x86: remove unnecessary exports
  KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
  tools: KVM: selftests: Convert clear/set_bit() to actual atomics
  tools: Drop "atomic_" prefix from atomic test_and_set_bit()
  tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
  KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
  perf tools: Use dedicated non-atomic clear/set bit helpers
  tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
  KVM: arm64: selftests: Enable single-step without a "full" ucall()
  KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
  KVM: Remove stale comment about KVM_REQ_UNHALT
  KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
  KVM: Reference to kvm_userspace_memory_region in doc and comments
  KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
  ...
parents 057b40f4 549a715b
Loading
Loading
Loading
Loading
+165 −109
Original line number Original line Diff line number Diff line
@@ -272,18 +272,6 @@ the VCPU file descriptor can be mmap-ed, including:
  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
  KVM_CAP_DIRTY_LOG_RING, see section 8.3.




4.6 KVM_SET_MEMORY_REGION
-------------------------

:Capability: basic
:Architectures: all
:Type: vm ioctl
:Parameters: struct kvm_memory_region (in)
:Returns: 0 on success, -1 on error

This ioctl is obsolete and has been removed.


4.7 KVM_CREATE_VCPU
4.7 KVM_CREATE_VCPU
-------------------
-------------------


@@ -368,17 +356,6 @@ see the description of the capability.
Note that the Xen shared info page, if configured, shall always be assumed
Note that the Xen shared info page, if configured, shall always be assumed
to be dirty. KVM will not explicitly mark it such.
to be dirty. KVM will not explicitly mark it such.


4.9 KVM_SET_MEMORY_ALIAS
------------------------

:Capability: basic
:Architectures: x86
:Type: vm ioctl
:Parameters: struct kvm_memory_alias (in)
:Returns: 0 (success), -1 (error)

This ioctl is obsolete and has been removed.



4.10 KVM_RUN
4.10 KVM_RUN
------------
------------
@@ -1332,7 +1309,7 @@ yet and must be cleared on entry.
	__u64 userspace_addr; /* start of the userspace allocated memory */
	__u64 userspace_addr; /* start of the userspace allocated memory */
  };
  };


  /* for kvm_memory_region::flags */
  /* for kvm_userspace_memory_region::flags */
  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
  #define KVM_MEM_READONLY	(1UL << 1)
  #define KVM_MEM_READONLY	(1UL << 1)


@@ -1377,10 +1354,6 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately.  Another
mmap() that affects the region will be made visible immediately.  Another
example is madvise(MADV_DROP).
example is madvise(MADV_DROP).


It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
allocation and is deprecated.



4.36 KVM_SET_TSS_ADDR
4.36 KVM_SET_TSS_ADDR
---------------------
---------------------
@@ -3293,6 +3266,7 @@ valid entries found.
----------------------
----------------------


:Capability: KVM_CAP_DEVICE_CTRL
:Capability: KVM_CAP_DEVICE_CTRL
:Architectures: all
:Type: vm ioctl
:Type: vm ioctl
:Parameters: struct kvm_create_device (in/out)
:Parameters: struct kvm_create_device (in/out)
:Returns: 0 on success, -1 on error
:Returns: 0 on success, -1 on error
@@ -3333,6 +3307,7 @@ number.
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
:Capability: KVM_CAP_DEVICE_CTRL, KVM_CAP_VM_ATTRIBUTES for vm device,
             KVM_CAP_VCPU_ATTRIBUTES for vcpu device
             KVM_CAP_VCPU_ATTRIBUTES for vcpu device
             KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
             KVM_CAP_SYS_ATTRIBUTES for system (/dev/kvm) device (no set)
:Architectures: x86, arm64, s390
:Type: device ioctl, vm ioctl, vcpu ioctl
:Type: device ioctl, vm ioctl, vcpu ioctl
:Parameters: struct kvm_device_attr
:Parameters: struct kvm_device_attr
:Returns: 0 on success, -1 on error
:Returns: 0 on success, -1 on error
@@ -4104,80 +4079,71 @@ flags values for ``struct kvm_msr_filter_range``:
``KVM_MSR_FILTER_READ``
``KVM_MSR_FILTER_READ``


  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
  indicates that a read should immediately fail, while a 1 indicates that
  indicates that read accesses should be denied, while a 1 indicates that
  a read for a particular MSR should be handled regardless of the default
  a read for a particular MSR should be allowed regardless of the default
  filter action.
  filter action.


``KVM_MSR_FILTER_WRITE``
``KVM_MSR_FILTER_WRITE``


  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
  indicates that a write should immediately fail, while a 1 indicates that
  indicates that write accesses should be denied, while a 1 indicates that
  a write for a particular MSR should be handled regardless of the default
  a write for a particular MSR should be allowed regardless of the default
  filter action.
  filter action.


``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``

  Filter both read and write accesses to MSRs using the given bitmap. A 0
  in the bitmap indicates that both reads and writes should immediately fail,
  while a 1 indicates that reads and writes for a particular MSR are not
  filtered by this range.

flags values for ``struct kvm_msr_filter``:
flags values for ``struct kvm_msr_filter``:


``KVM_MSR_FILTER_DEFAULT_ALLOW``
``KVM_MSR_FILTER_DEFAULT_ALLOW``


  If no filter range matches an MSR index that is getting accessed, KVM will
  If no filter range matches an MSR index that is getting accessed, KVM will
  fall back to allowing access to the MSR.
  allow accesses to all MSRs by default.


``KVM_MSR_FILTER_DEFAULT_DENY``
``KVM_MSR_FILTER_DEFAULT_DENY``


  If no filter range matches an MSR index that is getting accessed, KVM will
  If no filter range matches an MSR index that is getting accessed, KVM will
  fall back to rejecting access to the MSR. In this mode, all MSRs that should
  deny accesses to all MSRs by default.
  be processed by KVM need to explicitly be marked as allowed in the bitmaps.

This ioctl allows userspace to define up to 16 bitmaps of MSR ranges to deny
guest MSR accesses that would normally be allowed by KVM.  If an MSR is not
covered by a specific range, the "default" filtering behavior applies.  Each
bitmap range covers MSRs from [base .. base+nmsrs).

If an MSR access is denied by userspace, the resulting KVM behavior depends on
whether or not KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER is
enabled.  If KVM_MSR_EXIT_REASON_FILTER is enabled, KVM will exit to userspace
on denied accesses, i.e. userspace effectively intercepts the MSR access.  If
KVM_MSR_EXIT_REASON_FILTER is not enabled, KVM will inject a #GP into the guest
on denied accesses.


This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
If an MSR access is allowed by userspace, KVM will emulate and/or virtualize
specify whether a certain MSR access should be explicitly filtered for or not.
the access in accordance with the vCPU model.  Note, KVM may still ultimately
inject a #GP if an access is allowed by userspace, e.g. if KVM doesn't support
the MSR, or to follow architectural behavior for the MSR.


If this ioctl has never been invoked, MSR accesses are not guarded and the
By default, KVM operates in KVM_MSR_FILTER_DEFAULT_ALLOW mode with no MSR range
default KVM in-kernel emulation behavior is fully preserved.
filters.


Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
an error.
an error.


As soon as the filtering is in place, every MSR access is processed through
the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
register.

.. warning::
.. warning::
   MSR accesses coming from nested vmentry/vmexit are not filtered.
   MSR accesses as part of nested VM-Enter/VM-Exit are not filtered.
   This includes both writes to individual VMCS fields and reads/writes
   This includes both writes to individual VMCS fields and reads/writes
   through the MSR lists pointed to by the VMCS.
   through the MSR lists pointed to by the VMCS.


If a bit is within one of the defined ranges, read and write accesses are
   x2APIC MSR accesses cannot be filtered (KVM silently ignores filters that
guarded by the bitmap's value for the MSR index if the kind of access
   cover any x2APIC MSRs).
is included in the ``struct kvm_msr_filter_range`` flags.  If no range
cover this particular access, the behavior is determined by the flags
field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
and ``KVM_MSR_FILTER_DEFAULT_DENY``.

Each bitmap range specifies a range of MSRs to potentially allow access on.
The range goes from MSR index [base .. base+nmsrs]. The flags field
indicates whether reads, writes or both reads and writes are filtered
by setting a 1 bit in the bitmap for the corresponding MSR index.

If an MSR access is not permitted through the filtering, it generates a
#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
allows user space to deflect and potentially handle various MSR accesses
into user space.


Note, invoking this ioctl while a vCPU is running is inherently racy.  However,
Note, invoking this ioctl while a vCPU is running is inherently racy.  However,
KVM does guarantee that vCPUs will see either the previous filter or the new
KVM does guarantee that vCPUs will see either the previous filter or the new
filter, e.g. MSRs with identical settings in both the old and new filter will
filter, e.g. MSRs with identical settings in both the old and new filter will
have deterministic behavior.
have deterministic behavior.


Similarly, if userspace wishes to intercept on denied accesses,
KVM_MSR_EXIT_REASON_FILTER must be enabled before activating any filters, and
left enabled until after all filters are deactivated.  Failure to do so may
result in KVM injecting a #GP instead of exiting to userspace.

4.98 KVM_CREATE_SPAPR_TCE_64
4.98 KVM_CREATE_SPAPR_TCE_64
----------------------------
----------------------------


@@ -5163,10 +5129,13 @@ KVM_PV_ENABLE
  =====      =============================
  =====      =============================


KVM_PV_DISABLE
KVM_PV_DISABLE
  Deregister the VM from the Ultravisor and reclaim the memory that
  Deregister the VM from the Ultravisor and reclaim the memory that had
  had been donated to the Ultravisor, making it usable by the kernel
  been donated to the Ultravisor, making it usable by the kernel again.
  again.  All registered VCPUs are converted back to non-protected
  All registered VCPUs are converted back to non-protected ones. If a
  ones.
  previous protected VM had been prepared for asynchonous teardown with
  KVM_PV_ASYNC_CLEANUP_PREPARE and not subsequently torn down with
  KVM_PV_ASYNC_CLEANUP_PERFORM, it will be torn down in this call
  together with the current protected VM.


KVM_PV_VM_SET_SEC_PARMS
KVM_PV_VM_SET_SEC_PARMS
  Pass the image header from VM memory to the Ultravisor in
  Pass the image header from VM memory to the Ultravisor in
@@ -5289,6 +5258,36 @@ KVM_PV_DUMP
    authentication tag all of which are needed to decrypt the dump at a
    authentication tag all of which are needed to decrypt the dump at a
    later time.
    later time.


KVM_PV_ASYNC_CLEANUP_PREPARE
  :Capability: KVM_CAP_S390_PROTECTED_ASYNC_DISABLE

  Prepare the current protected VM for asynchronous teardown. Most
  resources used by the current protected VM will be set aside for a
  subsequent asynchronous teardown. The current protected VM will then
  resume execution immediately as non-protected. There can be at most
  one protected VM prepared for asynchronous teardown at any time. If
  a protected VM had already been prepared for teardown without
  subsequently calling KVM_PV_ASYNC_CLEANUP_PERFORM, this call will
  fail. In that case, the userspace process should issue a normal
  KVM_PV_DISABLE. The resources set aside with this call will need to
  be cleaned up with a subsequent call to KVM_PV_ASYNC_CLEANUP_PERFORM
  or KVM_PV_DISABLE, otherwise they will be cleaned up when KVM
  terminates. KVM_PV_ASYNC_CLEANUP_PREPARE can be called again as soon
  as cleanup starts, i.e. before KVM_PV_ASYNC_CLEANUP_PERFORM finishes.

KVM_PV_ASYNC_CLEANUP_PERFORM
  :Capability: KVM_CAP_S390_PROTECTED_ASYNC_DISABLE

  Tear down the protected VM previously prepared for teardown with
  KVM_PV_ASYNC_CLEANUP_PREPARE. The resources that had been set aside
  will be freed during the execution of this command. This PV command
  should ideally be issued by userspace from a separate thread. If a
  fatal signal is received (or the process terminates naturally), the
  command will terminate immediately without completing, and the normal
  KVM shutdown procedure will take care of cleaning up all remaining
  protected VMs, including the ones whose teardown was interrupted by
  process termination.

4.126 KVM_XEN_HVM_SET_ATTR
4.126 KVM_XEN_HVM_SET_ATTR
--------------------------
--------------------------


@@ -5306,6 +5305,7 @@ KVM_PV_DUMP
	union {
	union {
		__u8 long_mode;
		__u8 long_mode;
		__u8 vector;
		__u8 vector;
		__u8 runstate_update_flag;
		struct {
		struct {
			__u64 gfn;
			__u64 gfn;
		} shared_info;
		} shared_info;
@@ -5383,6 +5383,14 @@ KVM_XEN_ATTR_TYPE_XEN_VERSION
  event channel delivery, so responding within the kernel without
  event channel delivery, so responding within the kernel without
  exiting to userspace is beneficial.
  exiting to userspace is beneficial.


KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG
  This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
  support for KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG. It enables the
  XEN_RUNSTATE_UPDATE flag which allows guest vCPUs to safely read
  other vCPUs' vcpu_runstate_info. Xen guests enable this feature via
  the VM_ASST_TYPE_runstate_update_flag of the HYPERVISOR_vm_assist
  hypercall.

4.127 KVM_XEN_HVM_GET_ATTR
4.127 KVM_XEN_HVM_GET_ATTR
--------------------------
--------------------------


@@ -6440,16 +6448,18 @@ if it decides to decode and emulate the instruction.


Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
may instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
exit for writes.
exit for writes.


The "reason" field specifies why the MSR trap occurred. User space will only
The "reason" field specifies why the MSR interception occurred. Userspace will
receive MSR exit traps when a particular reason was requested during through
only receive MSR exits when a particular reason was requested during through
ENABLE_CAP. Currently valid exit reasons are:
ENABLE_CAP. Currently valid exit reasons are:


	KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
============================ ========================================
	KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
 KVM_MSR_EXIT_REASON_UNKNOWN access to MSR that is unknown to KVM
	KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER
 KVM_MSR_EXIT_REASON_INVAL   access to invalid MSRs or reserved bits
 KVM_MSR_EXIT_REASON_FILTER  access blocked by KVM_X86_SET_MSR_FILTER
============================ ========================================


For KVM_EXIT_X86_RDMSR, the "index" field tells userspace which MSR the guest
For KVM_EXIT_X86_RDMSR, the "index" field tells userspace which MSR the guest
wants to read. To respond to this request with a successful read, userspace
wants to read. To respond to this request with a successful read, userspace
@@ -6465,6 +6475,8 @@ wants to write. Once finished processing the event, user space must continue
vCPU execution. If the MSR write was unsuccessful, userspace also sets the
vCPU execution. If the MSR write was unsuccessful, userspace also sets the
"error" field to "1".
"error" field to "1".


See KVM_X86_SET_MSR_FILTER for details on the interaction with MSR filtering.

::
::




@@ -7229,8 +7241,8 @@ polling.
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
:Returns: 0 on success; -1 on error
:Returns: 0 on success; -1 on error


This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
This capability allows userspace to intercept RDMSR and WRMSR instructions if
into user space.
access to an MSR is denied.  By default, KVM injects #GP on denied accesses.


When a guest requests to read or write an MSR, KVM may not implement all MSRs
When a guest requests to read or write an MSR, KVM may not implement all MSRs
that are relevant to a respective system. It also does not differentiate by
that are relevant to a respective system. It also does not differentiate by
@@ -7238,10 +7250,20 @@ CPU type.


To allow more fine grained control over MSR handling, userspace may enable
To allow more fine grained control over MSR handling, userspace may enable
this capability. With it enabled, MSR accesses that match the mask specified in
this capability. With it enabled, MSR accesses that match the mask specified in
args[0] and trigger a #GP event inside the guest by KVM will instead trigger
args[0] and would trigger a #GP inside the guest will instead trigger
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications.  Userspace
can then handle to implement model specific MSR handling and/or user notifications
can then implement model specific MSR handling and/or user notifications
to inform a user that an MSR was not handled.
to inform a user that an MSR was not emulated/virtualized by KVM.

The valid mask flags are:

============================ ===============================================
 KVM_MSR_EXIT_REASON_UNKNOWN intercept accesses to unknown (to KVM) MSRs
 KVM_MSR_EXIT_REASON_INVAL   intercept accesses that are architecturally
                             invalid according to the vCPU model and/or mode
 KVM_MSR_EXIT_REASON_FILTER  intercept accesses that are denied by userspace
                             via KVM_X86_SET_MSR_FILTER
============================ ===============================================


7.22 KVM_CAP_X86_BUS_LOCK_EXIT
7.22 KVM_CAP_X86_BUS_LOCK_EXIT
-------------------------------
-------------------------------
@@ -7384,8 +7406,9 @@ hibernation of the host; however the VMM needs to manually save/restore the
tags as appropriate if the VM is migrated.
tags as appropriate if the VM is migrated.


When this capability is enabled all memory in memslots must be mapped as
When this capability is enabled all memory in memslots must be mapped as
not-shareable (no MAP_SHARED), attempts to create a memslot with a
``MAP_ANONYMOUS`` or with a RAM-based file mapping (``tmpfs``, ``memfd``),
MAP_SHARED mmap will result in an -EINVAL return.
attempts to create a memslot with an invalid mmap will result in an
-EINVAL return.


When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
perform a bulk copy of tags to/from the guest.
perform a bulk copy of tags to/from the guest.
@@ -7901,7 +7924,7 @@ KVM_EXIT_X86_WRMSR exit notifications.
This capability indicates that KVM supports that accesses to user defined MSRs
This capability indicates that KVM supports that accesses to user defined MSRs
may be rejected. With this capability exposed, KVM exports new VM ioctl
may be rejected. With this capability exposed, KVM exports new VM ioctl
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
ranges that KVM should reject access to.
ranges that KVM should deny access to.


In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
trap and emulate MSRs that are outside of the scope of KVM as well as
trap and emulate MSRs that are outside of the scope of KVM as well as
@@ -7920,7 +7943,7 @@ regardless of what has actually been exposed through the CPUID leaf.
8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
----------------------------------------------------------
----------------------------------------------------------


:Architectures: x86
:Architectures: x86, arm64
:Parameters: args[0] - size of the dirty log ring
:Parameters: args[0] - size of the dirty log ring


KVM is capable of tracking dirty memory using ring buffers that are
KVM is capable of tracking dirty memory using ring buffers that are
@@ -8002,13 +8025,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.
vmexit ensures that all dirty GFNs are flushed to the dirty rings.


NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
machine will switch to ring-buffer dirty page tracking and further
KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.

NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that
should be exposed by weakly ordered architecture, in order to indicate
should be exposed by weakly ordered architecture, in order to indicate
the additional memory ordering requirements imposed on userspace when
the additional memory ordering requirements imposed on userspace when
@@ -8017,6 +8033,33 @@ Architecture with TSO-like ordering (such as x86) are allowed to
expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
to userspace.
to userspace.


After enabling the dirty rings, the userspace needs to detect the
capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the
ring structures can be backed by per-slot bitmaps. With this capability
advertised, it means the architecture can dirty guest pages without
vcpu/ring context, so that some of the dirty information will still be
maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP
can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL
hasn't been enabled, or any memslot has been existing.

Note that the bitmap here is only a backup of the ring structure. The
use of the ring and bitmap combination is only beneficial if there is
only a very small amount of memory that is dirtied out of vcpu/ring
context. Otherwise, the stand-alone per-slot bitmap mechanism needs to
be considered.

To collect dirty bits in the backup bitmap, userspace can use the same
KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all
the generation of the dirty bits is done in a single pass. Collecting
the dirty bitmap should be the very last thing that the VMM does before
considering the state as complete. VMM needs to ensure that the dirty
state is final and avoid missing dirty pages from another ioctl ordered
after the bitmap collection.

NOTE: One example of using the backup bitmap is saving arm64 vgic/its
tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on
KVM device "kvm-arm-vgic-its" when dirty ring is enabled.

8.30 KVM_CAP_XEN_HVM
8.30 KVM_CAP_XEN_HVM
--------------------
--------------------


@@ -8031,6 +8074,7 @@ PVHVM guests. Valid flags are::
  #define KVM_XEN_HVM_CONFIG_RUNSTATE			(1 << 3)
  #define KVM_XEN_HVM_CONFIG_RUNSTATE			(1 << 3)
  #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL		(1 << 4)
  #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL		(1 << 4)
  #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND		(1 << 5)
  #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND		(1 << 5)
  #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG	(1 << 6)


The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
ioctl is available, for the guest to set its hypercall page.
ioctl is available, for the guest to set its hypercall page.
@@ -8062,6 +8106,18 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID/TIMER/UPCALL_VECTOR vCPU attributes.
related to event channel delivery, timers, and the XENVER_version
related to event channel delivery, timers, and the XENVER_version
interception.
interception.


The KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG flag indicates that KVM supports
the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute in the KVM_XEN_SET_ATTR
and KVM_XEN_GET_ATTR ioctls. This controls whether KVM will set the
XEN_RUNSTATE_UPDATE flag in guest memory mapped vcpu_runstate_info during
updates of the runstate information. Note that versions of KVM which support
the RUNSTATE feature above, but not thie RUNSTATE_UPDATE_FLAG feature, will
always set the XEN_RUNSTATE_UPDATE flag when updating the guest structure,
which is perhaps counterintuitive. When this flag is advertised, KVM will
behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless
specifically enabled (by the guest making the hypercall, causing the VMM
to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute).

8.31 KVM_CAP_PPC_MULTITCE
8.31 KVM_CAP_PPC_MULTITCE
-------------------------
-------------------------


+8 −6
Original line number Original line Diff line number Diff line
@@ -23,21 +23,23 @@ the PV_TIME_FEATURES hypercall should be probed using the SMCCC 1.1
ARCH_FEATURES mechanism before calling it.
ARCH_FEATURES mechanism before calling it.


PV_TIME_FEATURES
PV_TIME_FEATURES
    ============= ========    ==========

    ============= ========    =================================================
    Function ID:  (uint32)    0xC5000020
    Function ID:  (uint32)    0xC5000020
    PV_call_id:   (uint32)    The function to query for support.
    PV_call_id:   (uint32)    The function to query for support.
                              Currently only PV_TIME_ST is supported.
                              Currently only PV_TIME_ST is supported.
    Return value: (int64)     NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
    Return value: (int64)     NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
                              PV-time feature is supported by the hypervisor.
                              PV-time feature is supported by the hypervisor.
    ============= ========    ==========
    ============= ========    =================================================


PV_TIME_ST
PV_TIME_ST
    ============= ========    ==========

    ============= ========    ==============================================
    Function ID:  (uint32)    0xC5000021
    Function ID:  (uint32)    0xC5000021
    Return value: (int64)     IPA of the stolen time data structure for this
    Return value: (int64)     IPA of the stolen time data structure for this
                              VCPU. On failure:
                              VCPU. On failure:
                              NOT_SUPPORTED (-1)
                              NOT_SUPPORTED (-1)
    ============= ========    ==========
    ============= ========    ==============================================


The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory
The IPA returned by PV_TIME_ST should be mapped by the guest as normal memory
with inner and outer write back caching attributes, in the inner shareable
with inner and outer write back caching attributes, in the inner shareable
@@ -76,5 +78,5 @@ It is advisable that one or more 64k pages are set aside for the purpose of
these structures and not used for other purposes, this enables the guest to map
these structures and not used for other purposes, this enables the guest to map
the region using 64k pages and avoids conflicting attributes with other memory.
the region using 64k pages and avoids conflicting attributes with other memory.


For the user space interface see Documentation/virt/kvm/devices/vcpu.rst
For the user space interface see
section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL".
:ref:`Documentation/virt/kvm/devices/vcpu.rst <kvm_arm_vcpu_pvtime_ctrl>`.
 No newline at end of file
+4 −1
Original line number Original line Diff line number Diff line
@@ -52,7 +52,10 @@ KVM_DEV_ARM_VGIC_GRP_CTRL


    KVM_DEV_ARM_ITS_SAVE_TABLES
    KVM_DEV_ARM_ITS_SAVE_TABLES
      save the ITS table data into guest RAM, at the location provisioned
      save the ITS table data into guest RAM, at the location provisioned
      by the guest in corresponding registers/table entries.
      by the guest in corresponding registers/table entries. Should userspace
      require a form of dirty tracking to identify which pages are modified
      by the saving process, it should use a bitmap even if using another
      mechanism to track the memory dirtied by the vCPUs.


      The layout of the tables in guest memory defines an ABI. The entries
      The layout of the tables in guest memory defines an ABI. The entries
      are laid out in little endian format as described in the last paragraph.
      are laid out in little endian format as described in the last paragraph.
+2 −0
Original line number Original line Diff line number Diff line
@@ -171,6 +171,8 @@ configured values on other VCPUs. Userspace should configure the interrupt
numbers on at least one VCPU after creating all VCPUs and before running any
numbers on at least one VCPU after creating all VCPUs and before running any
VCPUs.
VCPUs.


.. _kvm_arm_vcpu_pvtime_ctrl:

3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL
==================================
==================================


+10 −0
Original line number Original line Diff line number Diff line
@@ -11438,6 +11438,16 @@ F: arch/x86/kvm/svm/hyperv.*
F:	arch/x86/kvm/svm/svm_onhyperv.*
F:	arch/x86/kvm/svm/svm_onhyperv.*
F:	arch/x86/kvm/vmx/evmcs.*
F:	arch/x86/kvm/vmx/evmcs.*
KVM X86 Xen (KVM/Xen)
M:	David Woodhouse <dwmw2@infradead.org>
M:	Paul Durrant <paul@xen.org>
M:	Sean Christopherson <seanjc@google.com>
M:	Paolo Bonzini <pbonzini@redhat.com>
L:	kvm@vger.kernel.org
S:	Supported
T:	git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
F:	arch/x86/kvm/xen.*
KERNFS
KERNFS
M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
M:	Tejun Heo <tj@kernel.org>
M:	Tejun Heo <tj@kernel.org>
Loading