Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm (36824f19) · Commits · EulixOS / Software / Kernel

Documentation/virt/kvm/api.rst

+353 −3

Original line number	Diff line number	Diff line
		@@ -688,9 +688,14 @@ MSRs that have been set successfully.
		Defines the vcpu responses to the cpuid instruction. Applications
		should use the KVM_SET_CPUID2 ioctl if available.

		Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID
		configuration (if there is) is not corrupted. Userspace can get a copy of the
		resulting CPUID configuration through KVM_GET_CPUID2 in case.
		Caveat emptor:
		- If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
		configuration (if there is) is not corrupted. Userspace can get a copy
		of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
		- Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
		after running the guest, may cause guest instability.
		- Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
		may cause guest instability.

		::

		@@ -5034,6 +5039,260 @@ see KVM_XEN_VCPU_SET_ATTR above.
		The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
		with the KVM_XEN_VCPU_GET_ATTR ioctl.

		4.130 KVM_ARM_MTE_COPY_TAGS
		---------------------------

		:Capability: KVM_CAP_ARM_MTE
		:Architectures: arm64
		:Type: vm ioctl
		:Parameters: struct kvm_arm_copy_mte_tags
		:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
		arguments, -EFAULT if memory cannot be accessed).

		::

		struct kvm_arm_copy_mte_tags {
		__u64 guest_ipa;
		__u64 length;
		void __user *addr;
		__u64 flags;
		__u64 reserved[2];
		};

		Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
		``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
		field must point to a buffer which the tags will be copied to or from.

		``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
		``KVM_ARM_TAGS_FROM_GUEST``.

		The size of the buffer to store the tags is ``(length / 16)`` bytes
		(granules in MTE are 16 bytes long). Each byte contains a single tag
		value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
		``PTRACE_POKEMTETAGS``.

		If an error occurs before any data is copied then a negative error code is
		returned. If some tags have been copied before an error occurs then the number
		of bytes successfully copied is returned. If the call completes successfully
		then ``length`` is returned.

		4.131 KVM_GET_SREGS2
		------------------

		:Capability: KVM_CAP_SREGS2
		:Architectures: x86
		:Type: vcpu ioctl
		:Parameters: struct kvm_sregs2 (out)
		:Returns: 0 on success, -1 on error

		Reads special registers from the vcpu.
		This ioctl (when supported) replaces the KVM_GET_SREGS.

		::

		struct kvm_sregs2 {
		/* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
		struct kvm_segment cs, ds, es, fs, gs, ss;
		struct kvm_segment tr, ldt;
		struct kvm_dtable gdt, idt;
		__u64 cr0, cr2, cr3, cr4, cr8;
		__u64 efer;
		__u64 apic_base;
		__u64 flags;
		__u64 pdptrs[4];
		};

		flags values for ``kvm_sregs2``:

		``KVM_SREGS2_FLAGS_PDPTRS_VALID``

		Indicates thats the struct contain valid PDPTR values.


		4.132 KVM_SET_SREGS2
		------------------

		:Capability: KVM_CAP_SREGS2
		:Architectures: x86
		:Type: vcpu ioctl
		:Parameters: struct kvm_sregs2 (in)
		:Returns: 0 on success, -1 on error

		Writes special registers into the vcpu.
		See KVM_GET_SREGS2 for the data structures.
		This ioctl (when supported) replaces the KVM_SET_SREGS.

		4.133 KVM_GET_STATS_FD
		----------------------

		:Capability: KVM_CAP_STATS_BINARY_FD
		:Architectures: all
		:Type: vm ioctl, vcpu ioctl
		:Parameters: none
		:Returns: statistics file descriptor on success, < 0 on error

		Errors:

		====== ======================================================
		ENOMEM if the fd could not be created due to lack of memory
		EMFILE if the number of opened files exceeds the limit
		====== ======================================================

		The returned file descriptor can be used to read VM/vCPU statistics data in
		binary format. The data in the file descriptor consists of four blocks
		organized as follows:

		+-------------+
		\| Header \|
		+-------------+
		\| id string \|
		+-------------+
		\| Descriptors \|
		+-------------+
		\| Stats Data \|
		+-------------+

		Apart from the header starting at offset 0, please be aware that it is
		not guaranteed that the four blocks are adjacent or in the above order;
		the offsets of the id, descriptors and data blocks are found in the
		header. However, all four blocks are aligned to 64 bit offsets in the
		file and they do not overlap.

		All blocks except the data block are immutable. Userspace can read them
		only one time after retrieving the file descriptor, and then use ``pread`` or
		``lseek`` to read the statistics repeatedly.

		All data is in system endianness.

		The format of the header is as follows::

		struct kvm_stats_header {
		__u32 flags;
		__u32 name_size;
		__u32 num_desc;
		__u32 id_offset;
		__u32 desc_offset;
		__u32 data_offset;
		};

		The ``flags`` field is not used at the moment. It is always read as 0.

		The ``name_size`` field is the size (in byte) of the statistics name string
		(including trailing '\0') which is contained in the "id string" block and
		appended at the end of every descriptor.

		The ``num_desc`` field is the number of descriptors that are included in the
		descriptor block. (The actual number of values in the data block may be
		larger, since each descriptor may comprise more than one value).

		The ``id_offset`` field is the offset of the id string from the start of the
		file indicated by the file descriptor. It is a multiple of 8.

		The ``desc_offset`` field is the offset of the Descriptors block from the start
		of the file indicated by the file descriptor. It is a multiple of 8.

		The ``data_offset`` field is the offset of the Stats Data block from the start
		of the file indicated by the file descriptor. It is a multiple of 8.

		The id string block contains a string which identifies the file descriptor on
		which KVM_GET_STATS_FD was invoked. The size of the block, including the
		trailing ``'\0'``, is indicated by the ``name_size`` field in the header.

		The descriptors block is only needed to be read once for the lifetime of the
		file descriptor contains a sequence of ``struct kvm_stats_desc``, each followed
		by a string of size ``name_size``.

		#define KVM_STATS_TYPE_SHIFT 0
		#define KVM_STATS_TYPE_MASK (0xF << KVM_STATS_TYPE_SHIFT)
		#define KVM_STATS_TYPE_CUMULATIVE (0x0 << KVM_STATS_TYPE_SHIFT)
		#define KVM_STATS_TYPE_INSTANT (0x1 << KVM_STATS_TYPE_SHIFT)
		#define KVM_STATS_TYPE_PEAK (0x2 << KVM_STATS_TYPE_SHIFT)

		#define KVM_STATS_UNIT_SHIFT 4
		#define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT)
		#define KVM_STATS_UNIT_NONE (0x0 << KVM_STATS_UNIT_SHIFT)
		#define KVM_STATS_UNIT_BYTES (0x1 << KVM_STATS_UNIT_SHIFT)
		#define KVM_STATS_UNIT_SECONDS (0x2 << KVM_STATS_UNIT_SHIFT)
		#define KVM_STATS_UNIT_CYCLES (0x3 << KVM_STATS_UNIT_SHIFT)

		#define KVM_STATS_BASE_SHIFT 8
		#define KVM_STATS_BASE_MASK (0xF << KVM_STATS_BASE_SHIFT)
		#define KVM_STATS_BASE_POW10 (0x0 << KVM_STATS_BASE_SHIFT)
		#define KVM_STATS_BASE_POW2 (0x1 << KVM_STATS_BASE_SHIFT)

		struct kvm_stats_desc {
		__u32 flags;
		__s16 exponent;
		__u16 size;
		__u32 offset;
		__u32 unused;
		char name[];
		};

		The ``flags`` field contains the type and unit of the statistics data described
		by this descriptor. Its endianness is CPU native.
		The following flags are supported:

		Bits 0-3 of ``flags`` encode the type:
		* ``KVM_STATS_TYPE_CUMULATIVE``
		The statistics data is cumulative. The value of data can only be increased.
		Most of the counters used in KVM are of this type.
		The corresponding ``size`` field for this type is always 1.
		All cumulative statistics data are read/write.
		* ``KVM_STATS_TYPE_INSTANT``
		The statistics data is instantaneous. Its value can be increased or
		decreased. This type is usually used as a measurement of some resources,
		like the number of dirty pages, the number of large pages, etc.
		All instant statistics are read only.
		The corresponding ``size`` field for this type is always 1.
		* ``KVM_STATS_TYPE_PEAK``
		The statistics data is peak. The value of data can only be increased, and
		represents a peak value for a measurement, for example the maximum number
		of items in a hash table bucket, the longest time waited and so on.
		The corresponding ``size`` field for this type is always 1.

		Bits 4-7 of ``flags`` encode the unit:
		* ``KVM_STATS_UNIT_NONE``
		There is no unit for the value of statistics data. This usually means that
		the value is a simple counter of an event.
		* ``KVM_STATS_UNIT_BYTES``
		It indicates that the statistics data is used to measure memory size, in the
		unit of Byte, KiByte, MiByte, GiByte, etc. The unit of the data is
		determined by the ``exponent`` field in the descriptor.
		* ``KVM_STATS_UNIT_SECONDS``
		It indicates that the statistics data is used to measure time or latency.
		* ``KVM_STATS_UNIT_CYCLES``
		It indicates that the statistics data is used to measure CPU clock cycles.

		Bits 8-11 of ``flags``, together with ``exponent``, encode the scale of the
		unit:
		* ``KVM_STATS_BASE_POW10``
		The scale is based on power of 10. It is used for measurement of time and
		CPU clock cycles. For example, an exponent of -9 can be used with
		``KVM_STATS_UNIT_SECONDS`` to express that the unit is nanoseconds.
		* ``KVM_STATS_BASE_POW2``
		The scale is based on power of 2. It is used for measurement of memory size.
		For example, an exponent of 20 can be used with ``KVM_STATS_UNIT_BYTES`` to
		express that the unit is MiB.

		The ``size`` field is the number of values of this statistics data. Its
		value is usually 1 for most of simple statistics. 1 means it contains an
		unsigned 64bit data.

		The ``offset`` field is the offset from the start of Data Block to the start of
		the corresponding statistics data.

		The ``unused`` field is reserved for future support for other types of
		statistics data, like log/linear histogram. Its value is always 0 for the types
		defined above.

		The ``name`` field is the name string of the statistics data. The name string
		starts at the end of ``struct kvm_stats_desc``. The maximum length including
		the trailing ``'\0'``, is indicated by ``name_size`` in the header.

		The Stats Data block contains an array of 64-bit values in the same order
		as the descriptors in Descriptors block.

		5. The kvm_run structure
		========================

		@@ -6323,6 +6582,7 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them.
		This capability can be used to check / enable 2nd DAWR feature provided
		by POWER10 processor.


		7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
		-------------------------------------

		@@ -6362,6 +6622,66 @@ default.

		See Documentation/x86/sgx/2.Kernel-internals.rst for more details.

		7.26 KVM_CAP_PPC_RPT_INVALIDATE
		-------------------------------

		:Capability: KVM_CAP_PPC_RPT_INVALIDATE
		:Architectures: ppc
		:Type: vm

		This capability indicates that the kernel is capable of handling
		H_RPT_INVALIDATE hcall.

		In order to enable the use of H_RPT_INVALIDATE in the guest,
		user space might have to advertise it for the guest. For example,
		IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
		present in the "ibm,hypertas-functions" device-tree property.

		This capability is enabled for hypervisors on platforms like POWER9
		that support radix MMU.

		7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE
		--------------------------------------

		:Architectures: x86
		:Parameters: args[0] whether the feature should be enabled or not

		When this capability is enabled, an emulation failure will result in an exit
		to userspace with KVM_INTERNAL_ERROR (except when the emulator was invoked
		to handle a VMware backdoor instruction). Furthermore, KVM will now provide up
		to 15 instruction bytes for any exit to userspace resulting from an emulation
		failure. When these exits to userspace occur use the emulation_failure struct
		instead of the internal struct. They both have the same layout, but the
		emulation_failure struct matches the content better. It also explicitly
		defines the 'flags' field which is used to describe the fields in the struct
		that are valid (ie: if KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES is
		set in the 'flags' field then both 'insn_size' and 'insn_bytes' have valid data
		in them.)

		7.28 KVM_CAP_ARM_MTE
		--------------------

		:Architectures: arm64
		:Parameters: none

		This capability indicates that KVM (and the hardware) supports exposing the
		Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
		VMM before creating any VCPUs to allow the guest access. Note that MTE is only
		available to a guest running in AArch64 mode and enabling this capability will
		cause attempts to create AArch32 VCPUs to fail.

		When enabled the guest is able to access tags associated with any memory given
		to the guest. KVM will ensure that the tags are maintained during swap or
		hibernation of the host; however the VMM needs to manually save/restore the
		tags as appropriate if the VM is migrated.

		When this capability is enabled all memory in memslots must be mapped as
		not-shareable (no MAP_SHARED), attempts to create a memslot with a
		MAP_SHARED mmap will result in an -EINVAL return.

		When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
		perform a bulk copy of tags to/from the guest.

		8. Other capabilities.
		======================

		@@ -6891,3 +7211,33 @@ This capability is always enabled.
		This capability indicates that the KVM virtual PTP service is
		supported in the host. A VMM can check whether the service is
		available to the guest on migration.

		8.33 KVM_CAP_HYPERV_ENFORCE_CPUID
		-----------------------------

		Architectures: x86

		When enabled, KVM will disable emulated Hyper-V features provided to the
		guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
		currently implmented Hyper-V features are provided unconditionally when
		Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
		leaf.

		8.34 KVM_CAP_EXIT_HYPERCALL
		---------------------------

		:Capability: KVM_CAP_EXIT_HYPERCALL
		:Architectures: x86
		:Type: vm

		This capability, if enabled, will cause KVM to exit to userspace
		with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.

		Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
		of hypercalls that can be configured to exit to userspace.
		Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.

		The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
		of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace
		the hypercalls whose corresponding bit is in the argument, and return
		ENOSYS for the others.

Documentation/virt/kvm/cpuid.rst

+7 −0

Original line number	Diff line number	Diff line
		@@ -96,6 +96,13 @@ KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit
		before using extended destination
		ID bits in MSI address bits 11-5.

		KVM_FEATURE_HC_MAP_GPA_RANGE 16 guest checks this feature bit before
		using the map gpa range hypercall
		to notify the page state change

		KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before
		using MSR_KVM_MIGRATION_CONTROL

		KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side
		per-cpu warps are expected in
		kvmclock

Documentation/virt/kvm/hypercalls.rst

+21 −0

Original line number	Diff line number	Diff line
		@@ -169,3 +169,24 @@ a0: destination APIC ID

		:Usage example: When sending a call-function IPI-many to vCPUs, yield if
		any of the IPI target vCPUs was preempted.

		8. KVM_HC_MAP_GPA_RANGE
		-------------------------
		:Architecture: x86
		:Status: active
		:Purpose: Request KVM to map a GPA range with the specified attributes.

		a0: the guest physical address of the start page
		a1: the number of (4kb) pages (must be contiguous in GPA space)
		a2: attributes

		Where 'attributes' :
		* bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
		* bit 4 - plaintext = 0, encrypted = 1
		* bits 63:5 - reserved (must be zero)

		Implementation note: this hypercall is implemented in userspace via
		the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
		before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In
		addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
		must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.

Documentation/virt/kvm/locking.rst

+5 −0

Original line number	Diff line number	Diff line
		@@ -16,6 +16,11 @@ The acquisition orders for mutexes are as follows:
		- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
		them together is quite rare.

		- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
		synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
		can be taken inside a kvm->srcu read-side critical section,
		while kvm->slots_lock cannot.

		On x86:

		- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock

Documentation/virt/kvm/mmu.rst

+2 −5

Original line number	Diff line number	Diff line
		@@ -180,8 +180,8 @@ Shadow pages contain the following information:
		role.gpte_is_8_bytes:
		Reflects the size of the guest PTE for which the page is valid, i.e. '1'
		if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
		role.nxe:
		Contains the value of efer.nxe for which the page is valid.
		role.efer_nx:
		Contains the value of efer.nx for which the page is valid.
		role.cr0_wp:
		Contains the value of cr0.wp for which the page is valid.
		role.smep_andnot_wp:
		@@ -192,9 +192,6 @@ Shadow pages contain the following information:
		Contains the value of cr4.smap && !cr0.wp for which the page is valid
		(pages for which this is true are different from other pages; see the
		treatment of cr0.wp=0 below).
		role.ept_sp:
		This is a virtual flag to denote a shadowed nested EPT page. ept_sp
		is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
		role.smm:
		Is 1 if the page is valid in system management mode. This field
		determines which of the kvm_memslots array was used to build this