Unverified Commit 094e5940 authored by openeuler-ci-bot's avatar openeuler-ci-bot Committed by Gitee
Browse files

!2831 [22.03-LTS-SP3] TDP MMU Support

Merge Pull Request from: @yuzhang_intel 
 
Title: Add TDP MMU support for 22.03-LTS-SP3

Content:
This PR is to add TDP MMU support in KVM, main features are:
- Parallel page fault support in KVM MMU.
- Parallel operations to interact with Linux MMU notifier.
- Fast page fault support in TDP MMU (for fast dirty logging).
- Lazy rmap allocation, which opt-out KVM rmap data structures in TDP MMU mode, and save hsot memory.
- Misc cleanups and improvements for KVM MMU (e.g., dirty-logging fixes, MMU notifier optimization etc.).

Intel-kernel issue:
https://gitee.com/open_euler/dashboard?issue_id=I7S3VQ

Test:
Run all KVM selftest cases for branch intel/OLK-tdp-mmu-5.10 in 3 modes on Skylake server:
- TDP MMU enabled
- TDP MMU disabled
- EPT disabled

Also, compared above results with the ones tested for branch intel/OLK-5.10 in 2 modes (w/ and w/o EPT). No new failure found.
Note: test "triple_fault_event_test" failed in both branches, and the reason is irrelavant to KVM MMU. 

Known issue:
N/A

Default config change:
N/A 
 
Link:https://gitee.com/openeuler/kernel/pulls/2831

 

Reviewed-by: default avatarMao Bibo <maobibo@loongson.cn>
Reviewed-by: default avatarWei Li <liwei391@huawei.com>
Reviewed-by: default avatarJun Tian <jun.j.tian@intel.com>
Reviewed-by: default avatarJason Zeng <jason.zeng@intel.com>
Reviewed-by: default avatarZenghui Yu <yuzenghui@huawei.com>
Signed-off-by: default avatarJialin Zhang <zhangjialin11@huawei.com>
parents 35e6f352 685f4d8a
Loading
Loading
Loading
Loading
+37 −26
Original line number Diff line number Diff line
@@ -16,7 +16,19 @@ The acquisition orders for mutexes are as follows:
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
  them together is quite rare.

On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
  synchronize_srcu(&kvm->srcu).  Therefore kvm->slots_arch_lock
  can be taken inside a kvm->srcu read-side critical section,
  while kvm->slots_lock cannot.

On x86:

- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock

- kvm->arch.mmu_lock is an rwlock.  kvm->arch.tdp_mmu_pages_lock and
  kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
  cannot be taken without already holding kvm->arch.mmu_lock (typically with
  ``read_lock`` for the TDP MMU, thus the need for additional spinlocks).

Everything else is a leaf: no other lock is taken inside the critical
sections.
@@ -31,25 +43,24 @@ the mmu-lock on x86. Currently, the page fault can be fast in one of the
following two cases:

1. Access Tracking: The SPTE is not present, but it is marked for access
   tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
   restore the saved R/X bits. This is described in more detail later below.
   tracking. That means we need to restore the saved R/X bits. This is
   described in more detail later below.

2. Write-Protection: The SPTE is present and the fault is
   caused by write-protect. That means we just need to change the W bit of
   the spte.
2. Write-Protection: The SPTE is present and the fault is caused by
   write-protect. That means we just need to change the W bit of the spte.

What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
SPTE_MMU_WRITEABLE bit on the spte:
What we use to avoid all the race is the Host-writable bit and MMU-writable bit
on the spte:

- SPTE_HOST_WRITEABLE means the gfn is writable on host.
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
  the gfn is writable on guest mmu and it is not write-protected by shadow
  page write-protection.
- Host-writable means the gfn is writable in the host kernel page tables and in
  its KVM memslot.
- MMU-writable means the gfn is writable in the guest's mmu and it is not
  write-protected by shadow page write-protection.

On fast page fault path, we will use cmpxchg to atomically set the spte W
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
is safe because whenever changing these bits can be detected by cmpxchg.
bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
R/X bits if for an access-traced spte, or both. This is safe because whenever
changing these bits can be detected by cmpxchg.

But we need carefully check these cases:

@@ -178,17 +189,17 @@ See the comments in spte_has_volatile_bits() and mmu_spte_update().
Lockless Access Tracking:

This is used for Intel CPUs that are using EPT but do not support the EPT A/D
bits. In this case, when the KVM MMU notifier is called to track accesses to a
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
by clearing the RWX bits in the PTE and storing the original R & X bits in
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
PTE (using the ignored bit 62). When the VM tries to access the page later on,
a fault is generated and the fast page fault mechanism described above is used
to atomically restore the PTE to a Present state. The W bit is not saved when
the PTE is marked for access tracking and during restoration to the Present
state, the W bit is set depending on whether or not it was a write access. If
it wasn't, then the W bit will remain clear until a write access happens, at
which time it will be set using the Dirty tracking mechanism described above.
bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
when the KVM MMU notifier is called to track accesses to a page (via
kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
by clearing the RWX bits in the PTE and storing the original R & X bits in more
unused/ignored bits. When the VM tries to access the page later on, a fault is
generated and the fast page fault mechanism described above is used to
atomically restore the PTE to a Present state. The W bit is not saved when the
PTE is marked for access tracking and during restoration to the Present state,
the W bit is set depending on whether or not it was a write access. If it
wasn't, then the W bit will remain clear until a write access happens, at which
time it will be set using the Dirty tracking mechanism described above.

3. Reference
------------
+1 −6
Original line number Diff line number Diff line
@@ -29,7 +29,6 @@

#define __KVM_HAVE_ARCH_INTC_INITIALIZED

#define KVM_USER_MEM_SLOTS 512
#define KVM_HALT_POLL_NS_DEFAULT 500000

#include <kvm/arm_vgic.h>
@@ -526,11 +525,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
			      struct kvm_vcpu_events *events);

#define KVM_ARCH_WANT_MMU_NOTIFIER
int kvm_unmap_hva_range(struct kvm *kvm,
			unsigned long start, unsigned long end, unsigned flags);
int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
#define KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS

void kvm_arm_halt_guest(struct kvm *kvm);
void kvm_arm_resume_guest(struct kvm *kvm);
+1 −1
Original line number Diff line number Diff line
@@ -5,12 +5,12 @@
#ifndef __ASM_SPINLOCK_H
#define __ASM_SPINLOCK_H

#include <asm/qrwlock.h>
#include <asm/qspinlock.h>
#include <asm/paravirt.h>

/* How long a lock should spin before we consider blocking */
#define SPIN_THRESHOLD			(1 << 15)
#include <asm/qrwlock.h>

/* See include/linux/spinlock.h */
#define smp_mb__after_spinlock()	smp_mb()
+31 −87
Original line number Diff line number Diff line
@@ -872,7 +872,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
	 * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
	 * the page we just got a reference to gets unmapped before we have a
	 * chance to grab the mmu_lock, which ensure that if the page gets
	 * unmapped afterwards, the call to kvm_unmap_hva will take it away
	 * unmapped afterwards, the call to kvm_unmap_gfn will take it away
	 * from us again properly. This smp_rmb() interacts with the smp_wmb()
	 * in kvm_mmu_notifier_invalidate_<page|range_end>.
	 */
@@ -1106,126 +1106,70 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
	return ret;
}

static int handle_hva_to_gpa(struct kvm *kvm,
			     unsigned long start,
			     unsigned long end,
			     int (*handler)(struct kvm *kvm,
					    gpa_t gpa, u64 size,
					    void *data),
			     void *data)
{
	struct kvm_memslots *slots;
	struct kvm_memory_slot *memslot;
	int ret = 0;

	slots = kvm_memslots(kvm);

	/* we only care about the pages that the guest sees */
	kvm_for_each_memslot(memslot, slots) {
		unsigned long hva_start, hva_end;
		gfn_t gpa;

		hva_start = max(start, memslot->userspace_addr);
		hva_end = min(end, memslot->userspace_addr +
					(memslot->npages << PAGE_SHIFT));
		if (hva_start >= hva_end)
			continue;

		gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
		ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
	}

	return ret;
}

static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
{
	unsigned flags = *(unsigned *)data;
	bool may_block = flags & MMU_NOTIFIER_RANGE_BLOCKABLE;

	__unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
	return 0;
}

int kvm_unmap_hva_range(struct kvm *kvm,
			unsigned long start, unsigned long end, unsigned flags)
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
	if (!kvm->arch.mmu.pgt)
		return 0;

	trace_kvm_unmap_hva_range(start, end);
	handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, &flags);
	return 0;
}

static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
{
	kvm_pfn_t *pfn = (kvm_pfn_t *)data;

	WARN_ON(size != PAGE_SIZE);
	__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
			     (range->end - range->start) << PAGE_SHIFT,
			     range->may_block);

	/*
	 * The MMU notifiers will have unmapped a huge PMD before calling
	 * ->change_pte() (which in turn calls kvm_set_spte_hva()) and
	 * therefore we never need to clear out a huge PMD through this
	 * calling path and a memcache is not required.
	 */
	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, gpa, PAGE_SIZE,
			       __pfn_to_phys(*pfn), KVM_PGTABLE_PROT_R, NULL);
	return 0;
}

int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
	unsigned long end = hva + PAGE_SIZE;
	kvm_pfn_t pfn = pte_pfn(pte);
	kvm_pfn_t pfn = pte_pfn(range->pte);

	if (!kvm->arch.mmu.pgt)
		return 0;

	trace_kvm_set_spte_hva(hva);
	WARN_ON(range->end - range->start != 1);

	/*
	 * We've moved a page around, probably through CoW, so let's treat it
	 * just like a translation fault and clean the cache to the PoC.
	 */
	clean_dcache_guest_page(pfn, PAGE_SIZE);
	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn);

	/*
	 * The MMU notifiers will have unmapped a huge PMD before calling
	 * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
	 * therefore we never need to clear out a huge PMD through this
	 * calling path and a memcache is not required.
	 */
	kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
			       PAGE_SIZE, __pfn_to_phys(pfn),
			       KVM_PGTABLE_PROT_R, NULL);

	return 0;
}

static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
	pte_t pte;
	u64 size = (range->end - range->start) << PAGE_SHIFT;
	kvm_pte_t kpte;
	pte_t pte;

	if (!kvm->arch.mmu.pgt)
		return 0;

	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
	kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt, gpa);

	kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
					range->start << PAGE_SHIFT);
	pte = __pte(kpte);
	return pte_valid(pte) && pte_young(pte);
}

static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data)
{
	WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
	return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt, gpa);
}

int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
	if (!kvm->arch.mmu.pgt)
		return 0;
	trace_kvm_age_hva(start, end);
	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
}

int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
{
	if (!kvm->arch.mmu.pgt)
		return 0;
	trace_kvm_test_age_hva(hva);
	return handle_hva_to_gpa(kvm, hva, hva + PAGE_SIZE,
				 kvm_test_age_hva_handler, NULL);
	return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
					   range->start << PAGE_SHIFT);
}

phys_addr_t kvm_mmu_get_httbr(void)
+0 −66
Original line number Diff line number Diff line
@@ -136,72 +136,6 @@ TRACE_EVENT(kvm_mmio_emulate,
		  __entry->vcpu_pc, __entry->instr, __entry->cpsr)
);

TRACE_EVENT(kvm_unmap_hva_range,
	TP_PROTO(unsigned long start, unsigned long end),
	TP_ARGS(start, end),

	TP_STRUCT__entry(
		__field(	unsigned long,	start		)
		__field(	unsigned long,	end		)
	),

	TP_fast_assign(
		__entry->start		= start;
		__entry->end		= end;
	),

	TP_printk("mmu notifier unmap range: %#016lx -- %#016lx",
		  __entry->start, __entry->end)
);

TRACE_EVENT(kvm_set_spte_hva,
	TP_PROTO(unsigned long hva),
	TP_ARGS(hva),

	TP_STRUCT__entry(
		__field(	unsigned long,	hva		)
	),

	TP_fast_assign(
		__entry->hva		= hva;
	),

	TP_printk("mmu notifier set pte hva: %#016lx", __entry->hva)
);

TRACE_EVENT(kvm_age_hva,
	TP_PROTO(unsigned long start, unsigned long end),
	TP_ARGS(start, end),

	TP_STRUCT__entry(
		__field(	unsigned long,	start		)
		__field(	unsigned long,	end		)
	),

	TP_fast_assign(
		__entry->start		= start;
		__entry->end		= end;
	),

	TP_printk("mmu notifier age hva: %#016lx -- %#016lx",
		  __entry->start, __entry->end)
);

TRACE_EVENT(kvm_test_age_hva,
	TP_PROTO(unsigned long hva),
	TP_ARGS(hva),

	TP_STRUCT__entry(
		__field(	unsigned long,	hva		)
	),

	TP_fast_assign(
		__entry->hva		= hva;
	),

	TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
);

TRACE_EVENT(kvm_set_way_flush,
	    TP_PROTO(unsigned long vcpu_pc, bool cache),
	    TP_ARGS(vcpu_pc, cache),
Loading