Commit 63f4b210 authored by Paolo Bonzini's avatar Paolo Bonzini
Browse files

Merge remote-tracking branch 'kvm/next' into kvm-next-5.20

KVM/s390, KVM/x86 and common infrastructure changes for 5.20

x86:

* Permit guests to ignore single-bit ECC errors

* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache

* Intel IPI virtualization

* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS

* PEBS virtualization

* Simplify PMU emulation by just using PERF_TYPE_RAW events

* More accurate event reinjection on SVM (avoid retrying instructions)

* Allow getting/setting the state of the speaker port data bit

* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent

* "Notify" VM exit (detect microarchitectural hangs) for Intel

* Cleanups for MCE MSR emulation

s390:

* add an interface to provide a hypervisor dump for secure guests

* improve selftests to use TAP interface

* enable interpretive execution of zPCI instructions (for PCI passthrough)

* First part of deferred teardown

* CPU Topology

* PV attestation

* Minor fixes

Generic:

* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple

x86:

* Use try_cmpxchg64 instead of cmpxchg64

* Bugfixes

* Ignore benign host accesses to PMU MSRs when PMU is disabled

* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior

* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis

* Port eager page splitting to shadow MMU as well

* Enable CMCI capability by default and handle injected UCNA errors

* Expose pid of vcpu threads in debugfs

* x2AVIC support for AMD

* cleanup PIO emulation

* Fixes for LLDT/LTR emulation

* Don't require refcounted "struct page" to create huge SPTEs

x86 cleanups:

* Use separate namespaces for guest PTEs and shadow PTEs bitmasks

* PIO emulation

* Reorganize rmap API, mostly around rmap destruction

* Do not workaround very old KVM bugs for L0 that runs with nesting enabled

* new selftests API for CPUID
parents 2e2e9115 7edc3a68
Loading
Loading
Loading
Loading
+1 −2
Original line number Diff line number Diff line
@@ -2418,8 +2418,7 @@
			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
			cleared.

			Eager page splitting currently only supports splitting
			huge pages mapped by the TDP MMU.
			Eager page splitting is only supported when kvm.tdp_mmu=Y.

			Default is Y (on).

+341 −3
Original line number Diff line number Diff line
@@ -1150,6 +1150,10 @@ The following bits are defined in the flags field:
  fields contain a valid state. This bit will be set whenever
  KVM_CAP_EXCEPTION_PAYLOAD is enabled.

- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
  triple_fault_pending field contains a valid state. This bit will
  be set whenever KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled.

ARM64:
^^^^^^

@@ -1245,6 +1249,10 @@ can be set in the flags field to signal that the
exception_has_payload, exception_payload, and exception.pending fields
contain a valid state and shall be written into the VCPU.

If KVM_CAP_X86_TRIPLE_FAULT_EVENT is enabled, KVM_VCPUEVENT_VALID_TRIPLE_FAULT
can be set in flags field to signal that the triple_fault field contains
a valid state and shall be written into the VCPU.

ARM64:
^^^^^^

@@ -2999,6 +3007,8 @@ Valid flags are::

  /* disable PIT in HPET legacy mode */
  #define KVM_PIT_FLAGS_HPET_LEGACY     0x00000001
  /* speaker port data bit enabled */
  #define KVM_PIT_FLAGS_SPEAKER_DATA_ON 0x00000002

This IOCTL replaces the obsolete KVM_GET_PIT.

@@ -5127,7 +5137,15 @@ into ESA mode. This reset is a superset of the initial reset.
	__u32 reserved[3];
  };

cmd values:
**Ultravisor return codes**
The Ultravisor return (reason) codes are provided by the kernel if a
Ultravisor call has been executed to achieve the results expected by
the command. Therefore they are independent of the IOCTL return
code. If KVM changes `rc`, its value will always be greater than 0
hence setting it to 0 before issuing a PV command is advised to be
able to detect a change of `rc`.

**cmd values:**

KVM_PV_ENABLE
  Allocate memory and register the VM with the Ultravisor, thereby
@@ -5143,7 +5161,6 @@ KVM_PV_ENABLE
  =====      =============================

KVM_PV_DISABLE

  Deregister the VM from the Ultravisor and reclaim the memory that
  had been donated to the Ultravisor, making it usable by the kernel
  again.  All registered VCPUs are converted back to non-protected
@@ -5160,6 +5177,117 @@ KVM_PV_VM_VERIFY
  Verify the integrity of the unpacked image. Only if this succeeds,
  KVM is allowed to start protected VCPUs.

KVM_PV_INFO
  :Capability: KVM_CAP_S390_PROTECTED_DUMP

  Presents an API that provides Ultravisor related data to userspace
  via subcommands. len_max is the size of the user space buffer,
  len_written is KVM's indication of how much bytes of that buffer
  were actually written to. len_written can be used to determine the
  valid fields if more response fields are added in the future.

  ::

     enum pv_cmd_info_id {
	KVM_PV_INFO_VM,
	KVM_PV_INFO_DUMP,
     };

     struct kvm_s390_pv_info_header {
	__u32 id;
	__u32 len_max;
	__u32 len_written;
	__u32 reserved;
     };

     struct kvm_s390_pv_info {
	struct kvm_s390_pv_info_header header;
	struct kvm_s390_pv_info_dump dump;
	struct kvm_s390_pv_info_vm vm;
     };

**subcommands:**

  KVM_PV_INFO_VM
    This subcommand provides basic Ultravisor information for PV
    hosts. These values are likely also exported as files in the sysfs
    firmware UV query interface but they are more easily available to
    programs in this API.

    The installed calls and feature_indication members provide the
    installed UV calls and the UV's other feature indications.

    The max_* members provide information about the maximum number of PV
    vcpus, PV guests and PV guest memory size.

    ::

      struct kvm_s390_pv_info_vm {
	__u64 inst_calls_list[4];
	__u64 max_cpus;
	__u64 max_guests;
	__u64 max_guest_addr;
	__u64 feature_indication;
      };


  KVM_PV_INFO_DUMP
    This subcommand provides information related to dumping PV guests.

    ::

      struct kvm_s390_pv_info_dump {
	__u64 dump_cpu_buffer_len;
	__u64 dump_config_mem_buffer_per_1m;
	__u64 dump_config_finalize_len;
      };

KVM_PV_DUMP
  :Capability: KVM_CAP_S390_PROTECTED_DUMP

  Presents an API that provides calls which facilitate dumping a
  protected VM.

  ::

    struct kvm_s390_pv_dmp {
      __u64 subcmd;
      __u64 buff_addr;
      __u64 buff_len;
      __u64 gaddr;		/* For dump storage state */
    };

  **subcommands:**

  KVM_PV_DUMP_INIT
    Initializes the dump process of a protected VM. If this call does
    not succeed all other subcommands will fail with -EINVAL. This
    subcommand will return -EINVAL if a dump process has not yet been
    completed.

    Not all PV vms can be dumped, the owner needs to set `dump
    allowed` PCF bit 34 in the SE header to allow dumping.

  KVM_PV_DUMP_CONFIG_STOR_STATE
     Stores `buff_len` bytes of tweak component values starting with
     the 1MB block specified by the absolute guest address
     (`gaddr`). `buff_len` needs to be `conf_dump_storage_state_len`
     aligned and at least >= the `conf_dump_storage_state_len` value
     provided by the dump uv_info data. buff_user might be written to
     even if an error rc is returned. For instance if we encounter a
     fault after writing the first page of data.

  KVM_PV_DUMP_COMPLETE
    If the subcommand succeeds it completes the dump process and lets
    KVM_PV_DUMP_INIT be called again.

    On success `conf_dump_finalize_len` bytes of completion data will be
    stored to the `buff_addr`. The completion data contains a key
    derivation seed, IV, tweak nonce and encryption keys as well as an
    authentication tag all of which are needed to decrypt the dump at a
    later time.


4.126 KVM_X86_SET_MSR_FILTER
----------------------------

@@ -5811,6 +5939,78 @@ of CPUID leaf 0xD on the host.

This ioctl injects an event channel interrupt directly to the guest vCPU.

4.136 KVM_S390_PV_CPU_COMMAND
-----------------------------

:Capability: KVM_CAP_S390_PROTECTED_DUMP
:Architectures: s390
:Type: vcpu ioctl
:Parameters: none
:Returns: 0 on success, < 0 on error

This ioctl closely mirrors `KVM_S390_PV_COMMAND` but handles requests
for vcpus. It re-uses the kvm_s390_pv_dmp struct and hence also shares
the command ids.

**command:**

KVM_PV_DUMP
  Presents an API that provides calls which facilitate dumping a vcpu
  of a protected VM.

**subcommand:**

KVM_PV_DUMP_CPU
  Provides encrypted dump data like register values.
  The length of the returned data is provided by uv_info.guest_cpu_stor_len.

4.137 KVM_S390_ZPCI_OP
----------------------

:Capability: KVM_CAP_S390_ZPCI_OP
:Architectures: s390
:Type: vm ioctl
:Parameters: struct kvm_s390_zpci_op (in)
:Returns: 0 on success, <0 on error

Used to manage hardware-assisted virtualization features for zPCI devices.

Parameters are specified via the following structure::

  struct kvm_s390_zpci_op {
	/* in */
	__u32 fh;		/* target device */
	__u8  op;		/* operation to perform */
	__u8  pad[3];
	union {
		/* for KVM_S390_ZPCIOP_REG_AEN */
		struct {
			__u64 ibv;	/* Guest addr of interrupt bit vector */
			__u64 sb;	/* Guest addr of summary bit */
			__u32 flags;
			__u32 noi;	/* Number of interrupts */
			__u8 isc;	/* Guest interrupt subclass */
			__u8 sbo;	/* Offset of guest summary bit vector */
			__u16 pad;
		} reg_aen;
		__u64 reserved[8];
	} u;
  };

The type of operation is specified in the "op" field.
KVM_S390_ZPCIOP_REG_AEN is used to register the VM for adapter event
notification interpretation, which will allow firmware delivery of adapter
events directly to the vm, with KVM providing a backup delivery mechanism;
KVM_S390_ZPCIOP_DEREG_AEN is used to subsequently disable interpretation of
adapter event notifications.

The target zPCI function must also be specified via the "fh" field.  For the
KVM_S390_ZPCIOP_REG_AEN operation, additional information to establish firmware
delivery must be provided via the "reg_aen" struct.

The "pad" and "reserved" fields may be used for future extensions and should be
set to 0s by userspace.

5. The kvm_run structure
========================

@@ -6414,6 +6614,26 @@ array field represents return values. The userspace should update the return
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
spec refer, https://github.com/riscv/riscv-sbi-doc.

::

    /* KVM_EXIT_NOTIFY */
    struct {
  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
      __u32 flags;
    } notify;

Used on x86 systems. When the VM capability KVM_CAP_X86_NOTIFY_VMEXIT is
enabled, a VM exit generated if no event window occurs in VM non-root mode
for a specified amount of time. Once KVM_X86_NOTIFY_VMEXIT_USER is set when
enabling the cap, it would exit to userspace with the exit reason
KVM_EXIT_NOTIFY for further handling. The "flags" field contains more
detailed info.

The valid value for 'flags' is:

  - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid
    in VMCS. It would run into unknown result if resume the target VM.

::

		/* Fix the size of the union. */
@@ -7357,8 +7577,71 @@ The valid bits in cap.args[0] are:
                                    hypercall instructions. Executing the
                                    incorrect hypercall instruction will
                                    generate a #UD within the guest.

KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
                                    they are intercepted) as NOPs regardless of
                                    whether or not MONITOR/MWAIT are supported
                                    according to guest CPUID.  When this quirk
                                    is disabled and KVM_X86_DISABLE_EXITS_MWAIT
                                    is not set (MONITOR/MWAIT are intercepted),
                                    KVM will inject a #UD on MONITOR/MWAIT if
                                    they're unsupported per guest CPUID.  Note,
                                    KVM will modify MONITOR/MWAIT support in
                                    guest CPUID on writes to MISC_ENABLE if
                                    KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
                                    disabled.
=================================== ============================================

7.32 KVM_CAP_MAX_VCPU_ID
------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] - maximum APIC ID value set for current VM
:Returns: 0 on success, -EINVAL if args[0] is beyond KVM_MAX_VCPU_IDS
          supported in KVM or if it has been set.

This capability allows userspace to specify maximum possible APIC ID
assigned for current VM session prior to the creation of vCPUs, saving
memory for data structures indexed by the APIC ID.  Userspace is able
to calculate the limit to APIC ID values from designated
CPU topology.

The value can be changed only until KVM_ENABLE_CAP is set to a nonzero
value or until a vCPU is created.  Upon creation of the first vCPU,
if the value was set to zero or KVM_ENABLE_CAP was not invoked, KVM
uses the return value of KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPU_ID) as
the maximum APIC ID.

7.33 KVM_CAP_X86_NOTIFY_VMEXIT
------------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] is the value of notify window as well as some flags
:Returns: 0 on success, -EINVAL if args[0] contains invalid flags or notify
          VM exit is unsupported.

Bits 63:32 of args[0] are used for notify window.
Bits 31:0 of args[0] are for some flags. Valid bits are::

  #define KVM_X86_NOTIFY_VMEXIT_ENABLED    (1 << 0)
  #define KVM_X86_NOTIFY_VMEXIT_USER       (1 << 1)

This capability allows userspace to configure the notify VM exit on/off
in per-VM scope during VM creation. Notify VM exit is disabled by default.
When userspace sets KVM_X86_NOTIFY_VMEXIT_ENABLED bit in args[0], VMM will
enable this feature with the notify window provided, which will generate
a VM exit if no event window occurs in VM non-root mode for a specified of
time (notify window).

If KVM_X86_NOTIFY_VMEXIT_USER is set in args[0], upon notify VM exits happen,
KVM would exit to userspace for handling.

This capability is aimed to mitigate the threat that malicious VMs can
cause CPU stuck (due to event windows don't open up) and make the CPU
unavailable to host or other VMs.

8. Other capabilities.
======================

@@ -7965,6 +8248,61 @@ should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of
type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request.

8.37 KVM_CAP_S390_PROTECTED_DUMP
--------------------------------

:Capability: KVM_CAP_S390_PROTECTED_DUMP
:Architectures: s390
:Type: vm

This capability indicates that KVM and the Ultravisor support dumping
PV guests. The `KVM_PV_DUMP` command is available for the
`KVM_S390_PV_COMMAND` ioctl and the `KVM_PV_INFO` command provides
dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
available and supports the `KVM_PV_DUMP_CPU` subcommand.

8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
---------------------------

:Capability KVM_CAP_VM_DISABLE_NX_HUGE_PAGES
:Architectures: x86
:Type: vm
:Parameters: arg[0] must be 0.
:Returns 0 on success, -EPERM if the userspace process does not
	 have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been
	 created.

This capability disables the NX huge pages mitigation for iTLB MULTIHIT.

The capability has no effect if the nx_huge_pages module parameter is not set.

This capability may only be set before any vCPUs are created.

8.39 KVM_CAP_S390_CPU_TOPOLOGY
------------------------------

:Capability: KVM_CAP_S390_CPU_TOPOLOGY
:Architectures: s390
:Type: vm

This capability indicates that KVM will provide the S390 CPU Topology
facility which consist of the interpretation of the PTF instruction for
the function code 2 along with interception and forwarding of both the
PTF instruction with function codes 0 or 1 and the STSI(15,1,x)
instruction to the userland hypervisor.

The stfle facility 11, CPU Topology facility, should not be indicated
to the guest without this capability.

When this capability is present, KVM provides a new attribute group
on vm fd, KVM_S390_VM_CPU_TOPOLOGY.
This new attribute allows to get, set or clear the Modified Change
Topology Report (MTCR) bit of the SCA through the kvm_device_attr
structure.

When getting the Modified Change Topology Report value, the attr->addr
must point to a byte where the value will be stored or retrieved from.

9. Known KVM API problems
=========================

+1 −0
Original line number Diff line number Diff line
@@ -10,3 +10,4 @@ KVM for s390 systems
   s390-diag
   s390-pv
   s390-pv-boot
   s390-pv-dump
+64 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===========================================
s390 (IBM Z) Protected Virtualization dumps
===========================================

Summary
-------

Dumping a VM is an essential tool for debugging problems inside
it. This is especially true when a protected VM runs into trouble as
there's no way to access its memory and registers from the outside
while it's running.

However when dumping a protected VM we need to maintain its
confidentiality until the dump is in the hands of the VM owner who
should be the only one capable of analysing it.

The confidentiality of the VM dump is ensured by the Ultravisor who
provides an interface to KVM over which encrypted CPU and memory data
can be requested. The encryption is based on the Customer
Communication Key which is the key that's used to encrypt VM data in a
way that the customer is able to decrypt.


Dump process
------------

A dump is done in 3 steps:

**Initiation**

This step initializes the dump process, generates cryptographic seeds
and extracts dump keys with which the VM dump data will be encrypted.

**Data gathering**

Currently there are two types of data that can be gathered from a VM:
the memory and the vcpu state.

The vcpu state contains all the important registers, general, floating
point, vector, control and tod/timers of a vcpu. The vcpu dump can
contain incomplete data if a vcpu is dumped while an instruction is
emulated with help of the hypervisor. This is indicated by a flag bit
in the dump data. For the same reason it is very important to not only
write out the encrypted vcpu state, but also the unencrypted state
from the hypervisor.

The memory state is further divided into the encrypted memory and its
metadata comprised of the encryption tweaks and status flags. The
encrypted memory can simply be read once it has been exported. The
time of the export does not matter as no re-encryption is
needed. Memory that has been swapped out and hence was exported can be
read from the swap and written to the dump target without need for any
special actions.

The tweaks / status flags for the exported pages need to be requested
from the Ultravisor.

**Finalization**

The finalization step will provide the data needed to be able to
decrypt the vcpu and memory data and end the dump process. When this
step completes successfully a new dump initiation can be started.
+1 −0
Original line number Diff line number Diff line
@@ -17594,6 +17594,7 @@ M: Eric Farman <farman@linux.ibm.com>
L:	linux-s390@vger.kernel.org
L:	kvm@vger.kernel.org
S:	Supported
F:	arch/s390/kvm/pci*
F:	drivers/vfio/pci/vfio_pci_zdev.c
F:	include/uapi/linux/vfio_zdev.h
Loading