Unverified Commit aedbc5c2 authored by openeuler-ci-bot's avatar openeuler-ci-bot Committed by Gitee
Browse files

!7110 [22.03-LTS-SP4] KVM TDP MMU new refactors

Merge Pull Request from: @yuzhang_intel 
 
Title: Add KVM TDP MMU new refactor code after 5.15

Content:
This PR is to add KVM TDP MMU new refactor code to openEuler-22.03-LTS-SP4. The major changes in this PR include: 
- TDP MMU related fixes/optimizations
- KVM MMU unloading optimizations(to greatly reduce the chances of KVM MMU unloading for TDP MMU)
- New workflows of KVM MMU page fault handling
- Memslot related optimizations
- TLB invalidation improvements
- Dirty logging optimization(eager page splitting)
- NX hugepage mitigation optimizations
- MISC fixes and cleanups

The conflicts are mostly due to lack of upstreaming features such as:
- Static calls for kvm_x86_ops
- Scalable memslot
- MISC shadow MMU refactors 

Intel-kernel issue:
https://gitee.com/openeuler/intel-kernel/issues/I9HH9U

Test:
Run all KVM selftest cases and kvm-unit-test for branch intel/OLK-tdp-mmu-new-refactors-5.10 on Skylake server, w/ and w/o TDP MMU enabled, no new failure found.

Known issue:
Latest code has already integrated with the changes of KVM dirty-ring (https://gitee.com/openeuler/kernel/pulls/5545). And all tests have be re-performed, no new failure found.

Default config change:
N/A 
 
Link:https://gitee.com/openeuler/kernel/pulls/7110

 

Reviewed-by: default avatarJason Zeng <jason.zeng@intel.com>
Reviewed-by: default avatarKevin Zhu <zhukeqian1@huawei.com>
Signed-off-by: default avatarJialin Zhang <zhangjialin11@huawei.com>
parents d2efa95e 1fe859b8
Loading
Loading
Loading
Loading
+33 −5
Original line number Diff line number Diff line
@@ -2408,13 +2408,34 @@
	kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
			Default is 0 (don't ignore, but inject #GP)

	kvm.eager_page_split=
			[KVM,X86] Controls whether or not KVM will try to
			proactively split all huge pages during dirty logging.
			Eager page splitting reduces interruptions to vCPU
			execution by eliminating the write-protection faults
			and MMU lock contention that would otherwise be
			required to split huge pages lazily.

			VM workloads that rarely perform writes or that write
			only to a small region of VM memory may benefit from
			disabling eager page splitting to allow huge pages to
			still be used for reads.

			The behavior of eager page splitting depends on whether
			KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
			disabled, all huge pages in a memslot will be eagerly
			split when dirty logging is enabled on that memslot. If
			enabled, eager page splitting will be performed during
			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
			cleared.

			Eager page splitting is only supported when kvm.tdp_mmu=Y.

			Default is Y (on).

	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
				   Default is false (don't support).

	kvm.mmu_audit=	[KVM] This is a R/W parameter which allows audit
			KVM MMU at runtime.
			Default is 0 (off)

	kvm.nx_huge_pages=
			[KVM] Controls the software workaround for the
			X86_BUG_ITLB_MULTIHIT bug.
@@ -2432,7 +2453,14 @@
			[KVM] Controls how many 4KiB pages are periodically zapped
			back to huge pages.  0 disables the recovery, otherwise if
			the value is N KVM will zap 1/Nth of the 4KiB pages every
			minute.  The default is 60.
			period (see below).  The default is 60.

	kvm.nx_huge_pages_recovery_period_ms=
			[KVM] Controls the time period at which KVM zaps 4KiB pages
			back to huge pages. If the value is a non-zero N, KVM will
			zap a portion (see ratio above) of the pages every N msecs.
			If the value is 0 (the default), KVM will pick a period based
			on the ratio, such that a page is zapped after 1 hour on average.

	kvm-amd.nested=	[KVM,AMD] Allow nested virtualization in KVM/SVM.
			Default is 1 (enabled)
+8 −3
Original line number Diff line number Diff line
@@ -700,9 +700,14 @@ MSRs that have been set successfully.
Defines the vcpu responses to the cpuid instruction.  Applications
should use the KVM_SET_CPUID2 ioctl if available.

Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID
configuration (if there is) is not corrupted. Userspace can get a copy of the
resulting CPUID configuration through KVM_GET_CPUID2 in case.
Caveat emptor:
  - If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
    configuration (if there is) is not corrupted. Userspace can get a copy
    of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
  - Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
    after running the guest, may cause guest instability.
  - Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
    may cause guest instability.

::

+6 −0
Original line number Diff line number Diff line
@@ -21,6 +21,12 @@ The acquisition orders for mutexes are as follows:
  can be taken inside a kvm->srcu read-side critical section,
  while kvm->slots_lock cannot.

- kvm->mn_active_invalidate_count ensures that pairs of
  invalidate_range_start() and invalidate_range_end() callbacks
  use the same memslots array.  kvm->slots_lock and kvm->slots_arch_lock
  are taken on the waiting side in install_new_memslots, so MMU notifiers
  must not take either kvm->slots_lock or kvm->slots_arch_lock.

On x86:

- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
+10 −9
Original line number Diff line number Diff line
@@ -161,7 +161,7 @@ Shadow pages contain the following information:
    If clear, this page corresponds to a guest page table denoted by the gfn
    field.
  role.quadrant:
    When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
    When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit
    sptes.  That means a guest page table contains more ptes than the host,
    so multiple shadow pages are needed to shadow one guest page.
    For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
@@ -177,11 +177,11 @@ Shadow pages contain the following information:
    The page is invalid and should not be used.  It is a root page that is
    currently pinned (by a cpu hardware register pointing to it); once it is
    unpinned it will be destroyed.
  role.gpte_is_8_bytes:
    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
  role.nxe:
    Contains the value of efer.nxe for which the page is valid.
  role.has_4_byte_gpte:
    Reflects the size of the guest PTE for which the page is valid, i.e. '0'
    if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use.
  role.efer_nx:
    Contains the value of efer.nx for which the page is valid.
  role.cr0_wp:
    Contains the value of cr0.wp for which the page is valid.
  role.smep_andnot_wp:
@@ -192,9 +192,6 @@ Shadow pages contain the following information:
    Contains the value of cr4.smap && !cr0.wp for which the page is valid
    (pages for which this is true are different from other pages; see the
    treatment of cr0.wp=0 below).
  role.ept_sp:
    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
  role.smm:
    Is 1 if the page is valid in system management mode.  This field
    determines which of the kvm_memslots array was used to build this
@@ -205,6 +202,10 @@ Shadow pages contain the following information:
    Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
    bits before Haswell; shadow EPT page tables also cannot use A/D bits
    if the L1 hypervisor does not enable them.
  role.passthrough:
    The page is not backed by a guest page table, but its first entry
    points to one.  This is set if NPT uses 5-level page tables (host
    CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1).
  gfn:
    Either the guest page table containing the translations shadowed by this
    page, or the base page frame for linear translations.  See role.direct.
+1 −1
Original line number Diff line number Diff line
@@ -502,7 +502,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
{
	phys_addr_t addr;
	int ret = 0;
	struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
	struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
	struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
				     KVM_PGTABLE_PROT_R |
Loading