!7110 [22.03-LTS-SP4] KVM TDP MMU new refactors (aedbc5c2) · Commits · EulixOS / Software / Kernel

Documentation/admin-guide/kernel-parameters.txt

+33 −5

Original line number	Diff line number	Diff line
		@@ -2408,13 +2408,34 @@
		kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
		Default is 0 (don't ignore, but inject #GP)

		kvm.eager_page_split=
		[KVM,X86] Controls whether or not KVM will try to
		proactively split all huge pages during dirty logging.
		Eager page splitting reduces interruptions to vCPU
		execution by eliminating the write-protection faults
		and MMU lock contention that would otherwise be
		required to split huge pages lazily.

		VM workloads that rarely perform writes or that write
		only to a small region of VM memory may benefit from
		disabling eager page splitting to allow huge pages to
		still be used for reads.

		The behavior of eager page splitting depends on whether
		KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
		disabled, all huge pages in a memslot will be eagerly
		split when dirty logging is enabled on that memslot. If
		enabled, eager page splitting will be performed during
		the KVM_CLEAR_DIRTY ioctl, and only for the pages being
		cleared.

		Eager page splitting is only supported when kvm.tdp_mmu=Y.

		Default is Y (on).

		kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
		Default is false (don't support).

		kvm.mmu_audit= [KVM] This is a R/W parameter which allows audit
		KVM MMU at runtime.
		Default is 0 (off)

		kvm.nx_huge_pages=
		[KVM] Controls the software workaround for the
		X86_BUG_ITLB_MULTIHIT bug.
		@@ -2432,7 +2453,14 @@
		[KVM] Controls how many 4KiB pages are periodically zapped
		back to huge pages. 0 disables the recovery, otherwise if
		the value is N KVM will zap 1/Nth of the 4KiB pages every
		minute. The default is 60.
		period (see below). The default is 60.

		kvm.nx_huge_pages_recovery_period_ms=
		[KVM] Controls the time period at which KVM zaps 4KiB pages
		back to huge pages. If the value is a non-zero N, KVM will
		zap a portion (see ratio above) of the pages every N msecs.
		If the value is 0 (the default), KVM will pick a period based
		on the ratio, such that a page is zapped after 1 hour on average.

		kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
		Default is 1 (enabled)

Documentation/virt/kvm/api.rst

+8 −3

Original line number	Diff line number	Diff line
		@@ -700,9 +700,14 @@ MSRs that have been set successfully.
		Defines the vcpu responses to the cpuid instruction. Applications
		should use the KVM_SET_CPUID2 ioctl if available.

		Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID
		configuration (if there is) is not corrupted. Userspace can get a copy of the
		resulting CPUID configuration through KVM_GET_CPUID2 in case.
		Caveat emptor:
		- If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
		configuration (if there is) is not corrupted. Userspace can get a copy
		of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
		- Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
		after running the guest, may cause guest instability.
		- Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
		may cause guest instability.

		::

Documentation/virt/kvm/locking.rst

+6 −0

Original line number	Diff line number	Diff line
		@@ -21,6 +21,12 @@ The acquisition orders for mutexes are as follows:
		can be taken inside a kvm->srcu read-side critical section,
		while kvm->slots_lock cannot.

		- kvm->mn_active_invalidate_count ensures that pairs of
		invalidate_range_start() and invalidate_range_end() callbacks
		use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
		are taken on the waiting side in install_new_memslots, so MMU notifiers
		must not take either kvm->slots_lock or kvm->slots_arch_lock.

		On x86:

		- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock

Documentation/virt/kvm/mmu.rst

+10 −9

Original line number	Diff line number	Diff line
		@@ -161,7 +161,7 @@ Shadow pages contain the following information:
		If clear, this page corresponds to a guest page table denoted by the gfn
		field.
		role.quadrant:
		When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
		When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit
		sptes. That means a guest page table contains more ptes than the host,
		so multiple shadow pages are needed to shadow one guest page.
		For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
		@@ -177,11 +177,11 @@ Shadow pages contain the following information:
		The page is invalid and should not be used. It is a root page that is
		currently pinned (by a cpu hardware register pointing to it); once it is
		unpinned it will be destroyed.
		role.gpte_is_8_bytes:
		Reflects the size of the guest PTE for which the page is valid, i.e. '1'
		if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
		role.nxe:
		Contains the value of efer.nxe for which the page is valid.
		role.has_4_byte_gpte:
		Reflects the size of the guest PTE for which the page is valid, i.e. '0'
		if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use.
		role.efer_nx:
		Contains the value of efer.nx for which the page is valid.
		role.cr0_wp:
		Contains the value of cr0.wp for which the page is valid.
		role.smep_andnot_wp:
		@@ -192,9 +192,6 @@ Shadow pages contain the following information:
		Contains the value of cr4.smap && !cr0.wp for which the page is valid
		(pages for which this is true are different from other pages; see the
		treatment of cr0.wp=0 below).
		role.ept_sp:
		This is a virtual flag to denote a shadowed nested EPT page. ept_sp
		is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
		role.smm:
		Is 1 if the page is valid in system management mode. This field
		determines which of the kvm_memslots array was used to build this
		@@ -205,6 +202,10 @@ Shadow pages contain the following information:
		Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D
		bits before Haswell; shadow EPT page tables also cannot use A/D bits
		if the L1 hypervisor does not enable them.
		role.passthrough:
		The page is not backed by a guest page table, but its first entry
		points to one. This is set if NPT uses 5-level page tables (host
		CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1).
		gfn:
		Either the guest page table containing the translations shadowed by this
		page, or the base page frame for linear translations. See role.direct.

arch/arm64/kvm/mmu.c

+1 −1

Original line number	Diff line number	Diff line
		@@ -502,7 +502,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
		{
		phys_addr_t addr;
		int ret = 0;
		struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
		struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
		struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
		enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE \|
		KVM_PGTABLE_PROT_R \|