Commit 36824f19 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull kvm updates from Paolo Bonzini:
 "This covers all architectures (except MIPS) so I don't expect any
  other feature pull requests this merge window.

  ARM:

   - Add MTE support in guests, complete with tag save/restore interface

   - Reduce the impact of CMOs by moving them in the page-table code

   - Allow device block mappings at stage-2

   - Reduce the footprint of the vmemmap in protected mode

   - Support the vGIC on dumb systems such as the Apple M1

   - Add selftest infrastructure to support multiple configuration and
     apply that to PMU/non-PMU setups

   - Add selftests for the debug architecture

   - The usual crop of PMU fixes

  PPC:

   - Support for the H_RPT_INVALIDATE hypercall

   - Conversion of Book3S entry/exit to C

   - Bug fixes

  S390:

   - new HW facilities for guests

   - make inline assembly more robust with KASAN and co

  x86:

   - Allow userspace to handle emulation errors (unknown instructions)

   - Lazy allocation of the rmap (host physical -> guest physical
     address)

   - Support for virtualizing TSC scaling on VMX machines

   - Optimizations to avoid shattering huge pages at the beginning of
     live migration

   - Support for initializing the PDPTRs without loading them from
     memory

   - Many TLB flushing cleanups

   - Refuse to load if two-stage paging is available but NX is not (this
     has been a requirement in practice for over a year)

   - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
     CR0/CR4/EFER, using the MMU mode everywhere once it is computed
     from the CPU registers

   - Use PM notifier to notify the guest about host suspend or hibernate

   - Support for passing arguments to Hyper-V hypercalls using XMM
     registers

   - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
     on AMD processors

   - Hide Hyper-V hypercalls that are not included in the guest CPUID

   - Fixes for live migration of virtual machines that use the Hyper-V
     "enlightened VMCS" optimization of nested virtualization

   - Bugfixes (not many)

  Generic:

   - Support for retrieving statistics without debugfs

   - Cleanups for the KVM selftests API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
  KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
  kvm: x86: disable the narrow guest module parameter on unload
  selftests: kvm: Allows userspace to handle emulation errors.
  kvm: x86: Allow userspace to handle emulation errors
  KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
  KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
  KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
  KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
  KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
  KVM: x86: Enhance comments for MMU roles and nested transition trickiness
  KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
  KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
  KVM: x86/mmu: Use MMU's role to determine PTTYPE
  KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
  KVM: x86/mmu: Add a helper to calculate root from role_regs
  KVM: x86/mmu: Add helper to update paging metadata
  KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
  KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
  KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
  KVM: x86/mmu: Get nested MMU's root level from the MMU's role
  ...
parents 9840cfcb b8917b4a
Loading
Loading
Loading
Loading
+353 −3
Original line number Diff line number Diff line
@@ -688,9 +688,14 @@ MSRs that have been set successfully.
Defines the vcpu responses to the cpuid instruction.  Applications
should use the KVM_SET_CPUID2 ioctl if available.

Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID
configuration (if there is) is not corrupted. Userspace can get a copy of the
resulting CPUID configuration through KVM_GET_CPUID2 in case.
Caveat emptor:
  - If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
    configuration (if there is) is not corrupted. Userspace can get a copy
    of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
  - Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
    after running the guest, may cause guest instability.
  - Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
    may cause guest instability.

::

@@ -5034,6 +5039,260 @@ see KVM_XEN_VCPU_SET_ATTR above.
The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
with the KVM_XEN_VCPU_GET_ATTR ioctl.

4.130 KVM_ARM_MTE_COPY_TAGS
---------------------------

:Capability: KVM_CAP_ARM_MTE
:Architectures: arm64
:Type: vm ioctl
:Parameters: struct kvm_arm_copy_mte_tags
:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
          arguments, -EFAULT if memory cannot be accessed).

::

  struct kvm_arm_copy_mte_tags {
	__u64 guest_ipa;
	__u64 length;
	void __user *addr;
	__u64 flags;
	__u64 reserved[2];
  };

Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
field must point to a buffer which the tags will be copied to or from.

``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
``KVM_ARM_TAGS_FROM_GUEST``.

The size of the buffer to store the tags is ``(length / 16)`` bytes
(granules in MTE are 16 bytes long). Each byte contains a single tag
value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
``PTRACE_POKEMTETAGS``.

If an error occurs before any data is copied then a negative error code is
returned. If some tags have been copied before an error occurs then the number
of bytes successfully copied is returned. If the call completes successfully
then ``length`` is returned.

4.131 KVM_GET_SREGS2
------------------

:Capability: KVM_CAP_SREGS2
:Architectures: x86
:Type: vcpu ioctl
:Parameters: struct kvm_sregs2 (out)
:Returns: 0 on success, -1 on error

Reads special registers from the vcpu.
This ioctl (when supported) replaces the KVM_GET_SREGS.

::

struct kvm_sregs2 {
	/* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
	struct kvm_segment cs, ds, es, fs, gs, ss;
	struct kvm_segment tr, ldt;
	struct kvm_dtable gdt, idt;
	__u64 cr0, cr2, cr3, cr4, cr8;
	__u64 efer;
	__u64 apic_base;
	__u64 flags;
	__u64 pdptrs[4];
};

flags values for ``kvm_sregs2``:

``KVM_SREGS2_FLAGS_PDPTRS_VALID``

  Indicates thats the struct contain valid PDPTR values.


4.132 KVM_SET_SREGS2
------------------

:Capability: KVM_CAP_SREGS2
:Architectures: x86
:Type: vcpu ioctl
:Parameters: struct kvm_sregs2 (in)
:Returns: 0 on success, -1 on error

Writes special registers into the vcpu.
See KVM_GET_SREGS2 for the data structures.
This ioctl (when supported) replaces the KVM_SET_SREGS.

4.133 KVM_GET_STATS_FD
----------------------

:Capability: KVM_CAP_STATS_BINARY_FD
:Architectures: all
:Type: vm ioctl, vcpu ioctl
:Parameters: none
:Returns: statistics file descriptor on success, < 0 on error

Errors:

  ======     ======================================================
  ENOMEM     if the fd could not be created due to lack of memory
  EMFILE     if the number of opened files exceeds the limit
  ======     ======================================================

The returned file descriptor can be used to read VM/vCPU statistics data in
binary format. The data in the file descriptor consists of four blocks
organized as follows:

+-------------+
|   Header    |
+-------------+
|  id string  |
+-------------+
| Descriptors |
+-------------+
| Stats Data  |
+-------------+

Apart from the header starting at offset 0, please be aware that it is
not guaranteed that the four blocks are adjacent or in the above order;
the offsets of the id, descriptors and data blocks are found in the
header.  However, all four blocks are aligned to 64 bit offsets in the
file and they do not overlap.

All blocks except the data block are immutable.  Userspace can read them
only one time after retrieving the file descriptor, and then use ``pread`` or
``lseek`` to read the statistics repeatedly.

All data is in system endianness.

The format of the header is as follows::

	struct kvm_stats_header {
		__u32 flags;
		__u32 name_size;
		__u32 num_desc;
		__u32 id_offset;
		__u32 desc_offset;
		__u32 data_offset;
	};

The ``flags`` field is not used at the moment. It is always read as 0.

The ``name_size`` field is the size (in byte) of the statistics name string
(including trailing '\0') which is contained in the "id string" block and
appended at the end of every descriptor.

The ``num_desc`` field is the number of descriptors that are included in the
descriptor block.  (The actual number of values in the data block may be
larger, since each descriptor may comprise more than one value).

The ``id_offset`` field is the offset of the id string from the start of the
file indicated by the file descriptor. It is a multiple of 8.

The ``desc_offset`` field is the offset of the Descriptors block from the start
of the file indicated by the file descriptor. It is a multiple of 8.

The ``data_offset`` field is the offset of the Stats Data block from the start
of the file indicated by the file descriptor. It is a multiple of 8.

The id string block contains a string which identifies the file descriptor on
which KVM_GET_STATS_FD was invoked.  The size of the block, including the
trailing ``'\0'``, is indicated by the ``name_size`` field in the header.

The descriptors block is only needed to be read once for the lifetime of the
file descriptor contains a sequence of ``struct kvm_stats_desc``, each followed
by a string of size ``name_size``.

	#define KVM_STATS_TYPE_SHIFT		0
	#define KVM_STATS_TYPE_MASK		(0xF << KVM_STATS_TYPE_SHIFT)
	#define KVM_STATS_TYPE_CUMULATIVE	(0x0 << KVM_STATS_TYPE_SHIFT)
	#define KVM_STATS_TYPE_INSTANT		(0x1 << KVM_STATS_TYPE_SHIFT)
	#define KVM_STATS_TYPE_PEAK		(0x2 << KVM_STATS_TYPE_SHIFT)

	#define KVM_STATS_UNIT_SHIFT		4
	#define KVM_STATS_UNIT_MASK		(0xF << KVM_STATS_UNIT_SHIFT)
	#define KVM_STATS_UNIT_NONE		(0x0 << KVM_STATS_UNIT_SHIFT)
	#define KVM_STATS_UNIT_BYTES		(0x1 << KVM_STATS_UNIT_SHIFT)
	#define KVM_STATS_UNIT_SECONDS		(0x2 << KVM_STATS_UNIT_SHIFT)
	#define KVM_STATS_UNIT_CYCLES		(0x3 << KVM_STATS_UNIT_SHIFT)

	#define KVM_STATS_BASE_SHIFT		8
	#define KVM_STATS_BASE_MASK		(0xF << KVM_STATS_BASE_SHIFT)
	#define KVM_STATS_BASE_POW10		(0x0 << KVM_STATS_BASE_SHIFT)
	#define KVM_STATS_BASE_POW2		(0x1 << KVM_STATS_BASE_SHIFT)

	struct kvm_stats_desc {
		__u32 flags;
		__s16 exponent;
		__u16 size;
		__u32 offset;
		__u32 unused;
		char name[];
	};

The ``flags`` field contains the type and unit of the statistics data described
by this descriptor. Its endianness is CPU native.
The following flags are supported:

Bits 0-3 of ``flags`` encode the type:
  * ``KVM_STATS_TYPE_CUMULATIVE``
    The statistics data is cumulative. The value of data can only be increased.
    Most of the counters used in KVM are of this type.
    The corresponding ``size`` field for this type is always 1.
    All cumulative statistics data are read/write.
  * ``KVM_STATS_TYPE_INSTANT``
    The statistics data is instantaneous. Its value can be increased or
    decreased. This type is usually used as a measurement of some resources,
    like the number of dirty pages, the number of large pages, etc.
    All instant statistics are read only.
    The corresponding ``size`` field for this type is always 1.
  * ``KVM_STATS_TYPE_PEAK``
    The statistics data is peak. The value of data can only be increased, and
    represents a peak value for a measurement, for example the maximum number
    of items in a hash table bucket, the longest time waited and so on.
    The corresponding ``size`` field for this type is always 1.

Bits 4-7 of ``flags`` encode the unit:
  * ``KVM_STATS_UNIT_NONE``
    There is no unit for the value of statistics data. This usually means that
    the value is a simple counter of an event.
  * ``KVM_STATS_UNIT_BYTES``
    It indicates that the statistics data is used to measure memory size, in the
    unit of Byte, KiByte, MiByte, GiByte, etc. The unit of the data is
    determined by the ``exponent`` field in the descriptor.
  * ``KVM_STATS_UNIT_SECONDS``
    It indicates that the statistics data is used to measure time or latency.
  * ``KVM_STATS_UNIT_CYCLES``
    It indicates that the statistics data is used to measure CPU clock cycles.

Bits 8-11 of ``flags``, together with ``exponent``, encode the scale of the
unit:
  * ``KVM_STATS_BASE_POW10``
    The scale is based on power of 10. It is used for measurement of time and
    CPU clock cycles.  For example, an exponent of -9 can be used with
    ``KVM_STATS_UNIT_SECONDS`` to express that the unit is nanoseconds.
  * ``KVM_STATS_BASE_POW2``
    The scale is based on power of 2. It is used for measurement of memory size.
    For example, an exponent of 20 can be used with ``KVM_STATS_UNIT_BYTES`` to
    express that the unit is MiB.

The ``size`` field is the number of values of this statistics data. Its
value is usually 1 for most of simple statistics. 1 means it contains an
unsigned 64bit data.

The ``offset`` field is the offset from the start of Data Block to the start of
the corresponding statistics data.

The ``unused`` field is reserved for future support for other types of
statistics data, like log/linear histogram. Its value is always 0 for the types
defined above.

The ``name`` field is the name string of the statistics data. The name string
starts at the end of ``struct kvm_stats_desc``.  The maximum length including
the trailing ``'\0'``, is indicated by ``name_size`` in the header.

The Stats Data block contains an array of 64-bit values in the same order
as the descriptors in Descriptors block.

5. The kvm_run structure
========================

@@ -6323,6 +6582,7 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them.
This capability can be used to check / enable 2nd DAWR feature provided
by POWER10 processor.


7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
-------------------------------------

@@ -6362,6 +6622,66 @@ default.

See Documentation/x86/sgx/2.Kernel-internals.rst for more details.

7.26 KVM_CAP_PPC_RPT_INVALIDATE
-------------------------------

:Capability: KVM_CAP_PPC_RPT_INVALIDATE
:Architectures: ppc
:Type: vm

This capability indicates that the kernel is capable of handling
H_RPT_INVALIDATE hcall.

In order to enable the use of H_RPT_INVALIDATE in the guest,
user space might have to advertise it for the guest. For example,
IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
present in the "ibm,hypertas-functions" device-tree property.

This capability is enabled for hypervisors on platforms like POWER9
that support radix MMU.

7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE
--------------------------------------

:Architectures: x86
:Parameters: args[0] whether the feature should be enabled or not

When this capability is enabled, an emulation failure will result in an exit
to userspace with KVM_INTERNAL_ERROR (except when the emulator was invoked
to handle a VMware backdoor instruction). Furthermore, KVM will now provide up
to 15 instruction bytes for any exit to userspace resulting from an emulation
failure.  When these exits to userspace occur use the emulation_failure struct
instead of the internal struct.  They both have the same layout, but the
emulation_failure struct matches the content better.  It also explicitly
defines the 'flags' field which is used to describe the fields in the struct
that are valid (ie: if KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES is
set in the 'flags' field then both 'insn_size' and 'insn_bytes' have valid data
in them.)

7.28 KVM_CAP_ARM_MTE
--------------------

:Architectures: arm64
:Parameters: none

This capability indicates that KVM (and the hardware) supports exposing the
Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
VMM before creating any VCPUs to allow the guest access. Note that MTE is only
available to a guest running in AArch64 mode and enabling this capability will
cause attempts to create AArch32 VCPUs to fail.

When enabled the guest is able to access tags associated with any memory given
to the guest. KVM will ensure that the tags are maintained during swap or
hibernation of the host; however the VMM needs to manually save/restore the
tags as appropriate if the VM is migrated.

When this capability is enabled all memory in memslots must be mapped as
not-shareable (no MAP_SHARED), attempts to create a memslot with a
MAP_SHARED mmap will result in an -EINVAL return.

When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
perform a bulk copy of tags to/from the guest.

8. Other capabilities.
======================

@@ -6891,3 +7211,33 @@ This capability is always enabled.
This capability indicates that the KVM virtual PTP service is
supported in the host. A VMM can check whether the service is
available to the guest on migration.

8.33 KVM_CAP_HYPERV_ENFORCE_CPUID
-----------------------------

Architectures: x86

When enabled, KVM will disable emulated Hyper-V features provided to the
guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
currently implmented Hyper-V features are provided unconditionally when
Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
leaf.

8.34 KVM_CAP_EXIT_HYPERCALL
---------------------------

:Capability: KVM_CAP_EXIT_HYPERCALL
:Architectures: x86
:Type: vm

This capability, if enabled, will cause KVM to exit to userspace
with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.

Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
of hypercalls that can be configured to exit to userspace.
Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.

The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
of the result of KVM_CHECK_EXTENSION.  KVM will forward to userspace
the hypercalls whose corresponding bit is in the argument, and return
ENOSYS for the others.
+7 −0
Original line number Diff line number Diff line
@@ -96,6 +96,13 @@ KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit
                                               before using extended destination
                                               ID bits in MSI address bits 11-5.

KVM_FEATURE_HC_MAP_GPA_RANGE       16          guest checks this feature bit before
                                               using the map gpa range hypercall
                                               to notify the page state change

KVM_FEATURE_MIGRATION_CONTROL      17          guest checks this feature bit before
                                               using MSR_KVM_MIGRATION_CONTROL

KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expected in
                                               kvmclock
+21 −0
Original line number Diff line number Diff line
@@ -169,3 +169,24 @@ a0: destination APIC ID

:Usage example: When sending a call-function IPI-many to vCPUs, yield if
	        any of the IPI target vCPUs was preempted.

8. KVM_HC_MAP_GPA_RANGE
-------------------------
:Architecture: x86
:Status: active
:Purpose: Request KVM to map a GPA range with the specified attributes.

a0: the guest physical address of the start page
a1: the number of (4kb) pages (must be contiguous in GPA space)
a2: attributes

    Where 'attributes' :
        * bits  3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
        * bit     4 - plaintext = 0, encrypted = 1
        * bits 63:5 - reserved (must be zero)

**Implementation note**: this hypercall is implemented in userspace via
the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID.  In
addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
+5 −0
Original line number Diff line number Diff line
@@ -16,6 +16,11 @@ The acquisition orders for mutexes are as follows:
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
  them together is quite rare.

- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
  synchronize_srcu(&kvm->srcu).  Therefore kvm->slots_arch_lock
  can be taken inside a kvm->srcu read-side critical section,
  while kvm->slots_lock cannot.

On x86:

- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
+2 −5
Original line number Diff line number Diff line
@@ -180,8 +180,8 @@ Shadow pages contain the following information:
  role.gpte_is_8_bytes:
    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
  role.nxe:
    Contains the value of efer.nxe for which the page is valid.
  role.efer_nx:
    Contains the value of efer.nx for which the page is valid.
  role.cr0_wp:
    Contains the value of cr0.wp for which the page is valid.
  role.smep_andnot_wp:
@@ -192,9 +192,6 @@ Shadow pages contain the following information:
    Contains the value of cr4.smap && !cr0.wp for which the page is valid
    (pages for which this is true are different from other pages; see the
    treatment of cr0.wp=0 below).
  role.ept_sp:
    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
  role.smm:
    Is 1 if the page is valid in system management mode.  This field
    determines which of the kvm_memslots array was used to build this
Loading